ML Lab Manual
ML Lab Manual
ML Lab Manual
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 7
—---------------------------------------------------------------------------------------
Title of the Assignment: Predict the price of the Uber ride from a given pickup
point to the agreed drop-off location. Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
Dataset Description: The project is about on world's largest taxi company Uber inc.
In this project, we're looking to predict the fare for their future transactional
cases. Uber delivers service to lakhs of customers daily. Now it becomes really
important to manage their data properly to come up with new business ideas to
get best results. Eventually, it becomes really important to estimate the fare
prices accurately.
Prerequisite:
1. Basic knowledge of Python
2. Concept of preprocessing data
3. Basic knowledge of Data Science and Big Data Analytics.
1. Data Preprocessing
2. Linear regression
3. Random forest regression models
4. Box Plot
5. Outliers
6. Haversine
7. Mathplotlib
8. Mean Squared Error
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for amachine
learning model. It is the rst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean itand put in a formatted
way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and e ciency
of a machine learning model.
●Importing libraries
●Importing datasets
●Feature scaling
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such assales, salary, age, product price,etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it nds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Random Forest is a popular machine learning algorithm that belongs to the supervisedlearning
technique. It can be used for both Classi cation and Regression problems in ML. Itis based on the
concept ofensemble learning,which is a process ofcombining multiple classi ers to solve a complex
problem and to improve the performance of the model.
The greater number of trees in the forest leads to higher accuracy and prevents theproblem of
over tting.
Boxplot:
Boxplots are a measure of how well data is distributed across a data set. This divides thedata set into
three quartiles. This graph represents the minimum, maximum, average, rst quartile, and the third
quartile in the data set. Boxplot is also useful in comparing thedistribution of data in a data set by
drawing a boxplot for each of them.
R provides a boxplot() function to create a boxplot. There is the following syntax of boxplot()function:
Here,
1. X It is a vector or a formula.
4. varwidth It is also a logical value set as true to draw the width of the boxsame
5. Names It is the group of labels that will be printed under each boxplot.
Outliers:
As the name suggests, "outliers" refer to the data points that exist outside of what is to beexpected. The
major thing about the outliers is what you do with them. If you are going toanalyze any task to analyze
data sets, you will always have some assumptions based onhow this data is generated. If you nd some
data points that are likely to contain some form of error, then these are de nitely outliers, and
depending on the context, you want toovercome those errors. The data mining process involves the
analysis and prediction of datathat the data holds. In 1969, Grubbs introduced the rst de nition of
outliers.
Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers.
When data points deviate from all the rest of the data points in a given data set, it is known as the global
outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers.
The green data point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is called collective
outliers. Here, the particular set of data objects may not be outliers, but when you consider the data
objects as a whole, they may behave as outliers. To identify thetypes of different outliers, you need to go
through background information about the relationship between the behavior of outliers shown by
different data objects. For example, in an Intrusion Detection System, the DOS package from one system
to another is taken as normal behavior. Therefore, if this happens with the various computer
simultaneously, it is considered abnormal behavior, and as a whole, they are called collective outliers.
The greendata points as a whole represent the collective outlier.
As the name suggests, "Contextual" means this outlier introduced within a context. For example, in the
speech recognition technique, the single background noise. Contextual outliers are also known as
Conditional outliers. These types of outliers happen if a data object deviates from the other data points
because of any speci c condition in a given data set. As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral attributes. Contextual outlier analysis enables the
users to examine outliers in different contexts and conditions, which can be useful in various applications.
For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the context of a summer season. In thegiven diagram, a
green dot representing the low-temperature value in June is a contextual outlier since the same value in
December is not an outlier.
Haversine:
The Haversine formula calculates the shortest distance between two points on a
sphere using their latitudes and longitudes measured along the surface. It is
important for use in navigation.
Matplotlib:
The Mean Squared Error (MSE) or Mean Squared Deviation (MSD)of an estimator
measures the average of error squares i.e. the average squared difference
between the estimated values and true value. It is a risk function, corresponding to
the expected value of the squared error loss. It is always non – negative and
values close to zero are better. The MSE is the second moment of the error (about
the origin) and thus incorporates both the variance of the estimator and its bias.
Conclusion:
In this way we have explored Concept correlation and implement linear regression and random
forest regression models.
Assignment Questions:
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 8
—---------------------------------------------------------------------------------------
Title of the Assignment: Classify the email using the binary classification method.
Email Spam detection has two states:
a) Normal State – Not Spam,
b) Abnormal State – Spam.
Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze
their performance.
Dataset Description: The csv file contains 5172 rows, each row for each email. There
are 3002 columns. The first column indicates Email name. The name has been set
with numbers and not recipients' name to protect privacy. The last column has the
labels for prediction : 1for spam, 0 for not spam. The remaining 3000 columns are
the 3000 most common words in all the emails, after excluding the non-alphabetical
characters/words. For each row, the count of each word(column) in that
email(row) is stored in the respective cells. Thus, information regarding all 5172
emails are stored in a compact dataframe rather than as separate text files.
Link: https://www.kaggle.com/datasets/balaka18/email-spam-classification-
Students should be able to classify email using the binary Classification and
implement email spam detection technique by using K-Nearest Neighbors and
Support Vector Machine algorithm.
Prerequisite:
1. Basic knowledge of Python
2. Concept of K-Nearest Neighbors and Support Vector Machine for classification.
1. Data Preprocessing
2. Binary Classification
3. K-Nearest Neighbours
4. Support Vector Machine
5. Train, Test and Split Procedure
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the rst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted
way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and efficiency
of a machine learning model.
●Importing libraries
●Importing datasets
●Feature scaling
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 9
—---------------------------------------------------------------------------------------
Dataset Description: The case study is from an open-source dataset from Kaggle.
The dataset contains 10,000 sample points with 14 distinct features such as
Customer Id, Credi t Score, Geography, Gender, Age, Tenure, Balance, etc.
Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
The term "Arti cial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
arti cial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
The given gure illustrates the typical diagram of Biological Neural Network.
Dendrites from Biological Neural Network represent inputs in Arti cial Neural Networks, cellnucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and arti cial neural network:
Dendrites Inputs
Synapse Weights
Axon
AnArti cial Neural Networkin the eld ofArti cial intelligencewhere it attempts to mimic the network
of neurons makes up a human brain so that computers will have an option tounderstand things and make
decisions in a human-like manner. The arti cial neural networkis designed by programming computers to
behave simply like interconnected brain cells.
We can understand the arti cial neural network with an example, consider an example of adigital logic
gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs
are "On," then we get "On" in output. If both the inputs are "Off," thenwe get "Off" in output. Here the
output depends upon input. Our brain does not perform the same task. The outputs to inputs relationship
keep changing because of the neurons in our brain, which are "learning."
To understand the concept of the architecture of an arti cial neural network, we have to understand what
a neural network consists of. In order to de ne a neural network that consists of a large number of
arti cial neurons, which are termed units arranged in a sequence of layers. Lets us look at various types
of layers available in an arti cial neural network.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
Output Layer:
The input goes through a series of transformations using the hidden layer, which nally results in output
that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias.
This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should re or not. Only those who are
red make it to the output layer. There are distinctive activation functions available that canbe applied
upon the sort of task we are performing.
Keras:
Keras is an open-source high-level Neural Network library, which is written in Python is capable enough
to run on Theano, TensorFlow, or CNTK. It was developed by one of the Google engineers, Francois
Chollet. It is made user-friendly, extensible, and modular for facilitating faster experimentation with deep
neural networks. It not only supports Convolutional Networks and Recurrent Networks individually but
also their combination.
It cannot handle low-level computations, so it makes use of theBackendlibrary to resolve it.The backend
library act as a high-level API wrapper for the low-level API, which lets it run onTensorFlow, CNTK, or
Theano.
Initially, it had over 4800 contributors during its launch, which now has gone up to 250,000 developers. It
has a 2X growth ever since every year it has grown. Big companies like Microsoft, Google, NVIDIA, and
Amazon have actively contributed to the development of Keras. It has an amazing industry interaction,
and it is used in the development of popular
rms likes Net ix, Uber, Google, Expedia, etc.
Tensor ow:
TensorFlow is a Google product, which is one of the most famous deep learning tools widelyused in the
research area of machine learning and deep neural network. It came into the market on 9th November
2015 under the Apache License 2.0. It is built in such a way that it can easily run on multiple CPUs and
GPUs as well as on mobile operating systems. It consists of various wrappers in distinct languages such as
Java, C++, or Python.
Normalization:
Normalization is a scaling technique in Machine Learning applied during data preparation tochange the
values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets in a
model. It is required only when features of machine learning models have different ranges.
Where,
Example:Let's assume we have a model dataset having maximum and minimum values of feature as
mentioned above. To normalize the machine learning model, values are shiftedand rescaled so their
range can vary between 0 and 1. This technique is also known asMin-Max scaling. In this scaling
technique, we will change the feature values as follows:
Case1-If the value of X is minimum, the value of Numerator will be 0; hence Normalizationwill also be
0.
Case2-If the value of X is maximum, then the value of the numerator is equal to the
denominator; hence Normalization will be 1.
Case3-On the other hand, if the value of X is neither maximum nor minimum, then values of
normalization will also be between 0 and 1.
Hence, Normalization can be de ned as a scaling method where values are shifted and rescaled to
maintain their ranges between 0 and 1, or in other words; it can be referred to as Min-Max scaling
technique.
Although there are so many feature normalization techniques in Machine Learning, few of them are most
frequently used. These are as follows:
●Standardization scaling:
Standardization scaling is also known asZ-scorenormalization, in which values are centered around the
mean with a unit standard deviation, which means the attributebecomes zero and the resultant
distribution has a unit standard deviation. Mathematically,we can calculate the standardization by
subtracting the feature value from the mean anddividing it by standard deviation.
Here,µrepresents the mean of feature value, andσrepresents the standard deviation offeature values.
However, unlike Min-Max scaling technique, feature values are not restricted to a speci crange in the
standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures such
asKNN, K-means clustering, and Principal component analysis, etc. Further,it is also important that the
model is built on assumptions and data is normally distributed.
Which is suitable for our machine learning model, Normalization or Standardization? This is probably a
big confusion among all data scientists as well as machine learning engineers. Although both terms have
the almost same meaning choice of using normalization or standardization will depend on your problem
and the algorithm you are using in models.
1. Normalization is a transformation technique that helps to improve the performance as well as the
accuracy of your model better. Normalization of a machine learning model is useful when you don't know
feature distribution exactly. In other words, the feature distribution of data does not follow
aGaussian(bell curve) distribution. Normalization must
Further, it is also useful for data having variable scaling techniques such asKNN, arti cial neural
networks. Hence, you can't use assumptions for the distribution of data.
2. Standardization in the machine learning model is useful when you are exactly aware of the feature
distribution of data or, in other words, your data follows a Gaussian distribution. However, this does not
have to be necessarily true. Unlike Normalization, Standardization does not necessarily have a bounding
range, so if you have outliers in your data, they will notbe affected by Standardization.
Further, it is also useful when data has variable dimensions and techniques such aslinear regression,
logistic regression, and linear discriminant analysis.
Example:Let's understand an experiment where we have a dataset having two attributes, i.e., age and
salary. Where the age ranges from 0 to 80 years old, and the income varies from 0 to 75,000 dollars or
more. Income is assumed to be 1,000 times that of age. As a result, the ranges of these two attributes are
much different from one another.
Because of its bigger value, the attributed income will organically in uence the conclusion more when
we undertake further analysis, such as multivariate linear regression. However, this does not necessarily
imply that it is a better predictor. As a result, we normalize the dataso that all of the variables are in the
same range.
Further, it is also helpful for the prediction of credit risk scores where normalization isapplied to
all numeric data except the class column. It uses thetanh transformation technique, which converts
all numeric features into values of range between 0 to 1.
Confusion Matrix:
The confusion matrix is a matrix used to determine the performance of the classi cation models for a
given set of test data. It can only be determined if the true values for test data are known. The matrix itself
can be easily understood, but the related terminologies may beconfusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as anerror matrix. Some features of
Confusion matrix are given below:
● For the 2 prediction classes of classi ers, the matrix is of 2*2 table, for 3 classes, it is3*3
table, and so on.
● Predicted values are those values, which are predicted by the model, and actualvalues
are the true values for the given observations.
●True Negative:Model has given prediction No, and the real or actual value was alsoNo.
●True Positive:The model has predicted yes, and the actual value was also true.
● False Negative:The model has predicted no, but the actual value was Yes, it is
alsocalled asType-II error.
● False Positive:The model has predicted Yes, but the actual value was No. It is
alsocalled aType-I error.
● It not only tells the error made by the classi ers but also the type of errors such as itis either
type- I or type-II error.
● With the help of the confusion matrix, we can calculate the different parameters forthe
model, such as accuracy, precision, etc.
Suppose we are trying to create a model that can predict the result for the disease that iseither a person
has that disease or not. So, the confusion matrix for this is given as:
●The table is given for the two-class classi er, which has two predictions "Yes" and
"NO." Here, Yes de nes that patient has the disease, and No de nes that patient doesnot has that
disease.
● The classi er has made a total of100 predictions. Out of 100 predictions,89 are
true predictions, and11 are incorrect predictions.
● The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
theactual "Yes" was 27, and actual "No" was 73 times.
We can perform various calculations for the model, such as the model's accuracy, using thismatrix. These
calculations are given below:
● Classi cation Accuracy:It is one of the important parameters to determine the accuracy of
the classi cation problems. It de nes how often the model predicts thecorrect output. It can be
calculated as the ratio of the number of correct predictionsmade by the classi er to all number of
predictions made by the classi ers. The formula is given below:
● Misclassi cation rate:It is also termed as Error rate, and it de nes how often the model
gives the wrong predictions. The value of error rate can be calculated as thenumber of
incorrect
predictions to all number of the predictions made by the classi er. The formula is given
below:
●Precision:It can be de ned as the number of correct outputs provided by the model
or out of all positive classes that have predicted correctly by the model, how many ofthem were
actually true. It can be calculated using the below formula:
● Recall:It is de ned as the out of total positive classes, how our model
predictedcorrectly. The recall must be as high as possible.
● F-measure:If two models have low precision and high recall or vice versa, it is di cult to
compare these models. So, for this purpose, we can use F-score. This score helps us to evaluate
the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the belowformula:
●Null Error rate:It de nes how often our model would be incorrect if it always predicted the
Conclusion:
In this way we build a a neural network-based er that can determine whether they willleave or not in
classi the next 6 months
Assignment Questions:
1) What is Normalization?
2) What is Standardization?
3) Explain Confusion Matrix ?
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 10
—---------------------------------------------------------------------------------------
Dataset Description: We will try to build a machine learning model to accurately predict
whether or not the patients in the dataset have diabetes or not?
The datasets consists of several medical predictor variables and one target variable,
Outcome. Predictor variables includes the number of pregnancies the patient has had,
their BMI, insulin level, age, and so on.
Link for Dataset: Diabetes predication system with KNN algorithm | Kaggle
Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
3. Concept of roc_auc curve.
4. Concept of Random Forest and KNN algorithms
k-Nearest-Neighbors (k-NN) is a supervised machine learning model. Supervised learning is when a model
learns from data that is already labeled. A supervised learning model takes in a set of input objects and output
values. The model then trains on that data to learn how to map the inputs to the desired output so it can learn to
make predictions on unseen data.
k-NN models work by taking a data point and looking at the „k‟ closest labeled data points. The data point is
then assigned the label of the majority of the „k‟ closest points.
For example, if k = 5, and 3 of points are „green‟ and 2 are „red‟, then the data point in question would be
labeled „green‟, since „green‟ is the majority (as shown in the above graph).
Scikit-learn is a machine learning library for Python. In this tutorial, we will build a k-NN model using Scikit-
learn to predict whether a patient has diabetes.
Reading in the training data
For our k-NN model, the first step is to read in the data we will use as input. For this
example, we are using the diabetes dataset. To start, we will use Pandas to read in the
data. I will not go into detail on Pandas, but it is a library you should become familiar
with if you ‘re looking to dive further into data science and machine learning.
Next, let ‘s see how much data we have. We will call the ‗shape ‘function on our
dataframe to see how many rows and columns there are in our data. The rows indicate
the number of patients and the columns indicate the number of features (age, weight,
etc.) in the dataset for each patient. 1
Op (768,9)
We can see that we have 768 rows of data (potential diabetes patients) and 9 columns (8
input features and 1 target output).
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Now let ‘s split up our dataset into inputs (X) and our target (y). Our input will be every column except
We will use pandas ‗drop ‘function to drop the column ‗diabetes ‘from our dataframe and
We will insert the ‗diabetes ‘column of our dataset into our target variable (y).
#separate target values
y = df[‗diabetes‘].values#view target
values y[0:5]
Now we will split the dataset into into training data and testing data. The training data is
the data that the model will learn from. The testing data is the data we will use to see
Scikit-learn has a function we can use called ‗train_test_split‘ that makes it easy for us
‗train_test_split‘ takes in 5 parameters. The first two parameters are the input and
target data we split up earlier. Next, we will set ‗test_size‘ to 0.2. This means that 20%
of all the data will be used for testing, which leaves 80% of the data as training data for
‗random_state‘ to 1 ensures that we get the same split each time so we can reproduce our results.
Setting ‗stratify‘ to y makes our training split represent the proportion of each value in
the y variable. For example, in our dataset, if 25% of patients have diabetes and 75%
‗stratify‘ to y will ensure that the random split has 25% of patients with diabetes and
First, we will create a new k-NN classifier and set ‗n_neighbors‘ to 3. To recap, this
means that if at least 2 out of the 3 nearest points to an new data point are patients
without diabetes, then the new data point will be labeled as ‗no diabetes‘, and vice3
versa. In other words, a new data point is labeled with by majority from the 3 nearest
points.
We have set ‗n_neighbors‘ to 3 as a starting point. We will go into more detail below on
how to better select a value for ‗n_neighbors‘ so that the model can improve its
performance.
Next, we need to train the model. In order to train our new model, we will use the ‗fit‘
function and pass in our training data as parameters to fit our model to the training
data.
Once the model is trained, we can use the ‗predict‘ function on our model to make
predictions on our test data. As seen when inspecting ‗y‘ earlier, 0 indicates that the
patient does not have diabetes and 1 indicates that the patient does have diabetes. To
save space, we will only show print the first 5 predictions of our test set.
#show first 5 model predictions on the
test data knn.predict(X_test)[0:5]
We can see that the model predicted ‗no diabetes‘ for the first 4 patients in the test set
Now let‘s see how our accurate our model is on the full test set. To do this, we will use
the ‗score‘ function and pass in our test input and target data to see how well our model
Our model has an accuracy of approximately 66.88%. It‘s a good start, but we will see
k-Fold Cross-Validation
Cross-validation is when the dataset is randomly split up into ‗k‘ groups. One of the
groups is used as the test set and the rest are used as the training set. The model is
4
trained on the training set and scored on the test set. Then the process is repeated until
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the
model would be trained and tested 5 separate times so each group would get a chance to
better than using the holdout method because the holdout method score is dependent on
how the data is split into train and test sets. Cross-validation gives the model an
opportunity to test on multiple splits so we can get a better idea on how the model will
In order to train and test our model using cross-validation, we will use the
our k-NN model and our data as parameters. Then it splits our data into 5 groups and fits
and scores our data 5 seperate times, recording the accuracy score in an array each
‗cv_scores‘ variable.
To find the average of the 5 scores, we will use numpy‘s mean function, passing in
Using cross-validation, our mean score is about 71.36%. This is a more accurate
representation of how our model will perform on unseen data than our earlier testing
When built our initial k-NN model, we set the parameter ‗n_neighbors‘ to 3 as a starting
parameters for your model to improve accuracy. In our case, we will use GridSearchCV to
‗n_neighbors‘.
GridSearchCV works by training our model multiple times on a range of parameters that
we specify. That way, we can test our model with each parameter and figure out the
For our model, we will specify a range of values for ‗n_neighbors‘ in order to see which
value works best for our model. To do this, we will create a dictionary, setting
‗n_neighbors‘ as the key and using numpy to create an array of values from 1 to 24.
Our new model using grid search will take in a new k-NN classifier, our param_grid and a
cross- validation value of 5 in order to find the optimal value for ‗n_neighbors‘.
from sklearn.model_selection import GridSearchCV#create new a knn model
knn2 = KNeighborsClassifier()#create a dictionary of all values we want to test for
n_neighbors param_grid = {‗n_neighbors‘: np.arange(1, 25)}#use gridsearch to test all
values for n_neighbors knn_gscv = GridSearchCV(knn2, param_grid, cv=5)#fit model to
data
After training, we can check which of our values for ‗n_neighbors‘ that we tested
We can see that 14 is the optimal value for ‗n_neighbors‘. We can use the ‗best_score_‘
function to check the accuracy of our model when ‗n_neighbors‘ is 14. ‗best_score_‘
By using grid search to find the optimal parameter for our model, we have improved our
Code :-
Conclusion:
In this way we build a neural network-based classifier that can determine whether they will leave or
not inthe next 6 months
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 11
—---------------------------------------------------------------------------------------
Prerequisite:
1. Knowledge of Python
2. Unsupervised learning
3. Clustering
4. Elbow method
Diverse and different types of data are subdivided into smaller groups.
Uses of Clustering
Marketing:
Real Estate:
Libraries and Bookstores can use Clustering to better manage the book
database. With proper book ordering, better operations can be
implemented. 9
Document Analysis:
K-Means Clustering
K-Means clustering is an unsupervised machine learning algorithm
that divides the given data into the given number of clusters. Here, the
―K‖ is the given number of predefined clusters, that need to be
created.
The algorithm takes raw unlabelled data as an input and divides the
dataset into clusters and the process is repeated until the best
clusters are found.
%matplotlib inline
The data is read. I will share a link to the entire code and excel data
at the end of the article.
The data has 200 entries, that is data from 200 customers.
1
data.head()
data.corr()
Age Distribution:
#Distribution of age
plt.figure(figsize=(10,
6)) sns.set(style =
'whitegrid')
sns.distplot(data['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')
Gender Analysis:
genders = data.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index,
y=genders.values) plt.show()
First, we work with two features only, annual income and spending score.
#The input
data X.head()
Now we calculate the Within Cluster Sum of Squared Errors (WSS) for
different values of k. Next, we choose the k for which WSS first starts
to diminish. This value of K gives us the best number of clusters to
make from the raw data.
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=
i) km.fit(X)
wcss.append(km.inertia
_)
#The elbow curve
plt.figure(figsize=(12,
6))
plt.plot(range(1,11),wc
ss)
This is known as the elbow graph, the x-axis being the number of
clusters, the number of clusters is taken at the elbow joint point. This
point is the point where making clusters is most relevant as here the
value of WCSS suddenly stops decreasing. Here in the graph, after 5
the drop is minimal, so we take 5 to be the number of clusters.
#Taking 5 clusters
km1=KMeans(n_clusters=
5) #Fitting the input
data km1.fit(X)
#predicting the labels of the input data
y=km1.predict(X)
#adding the labels to a column named label
df1["label"] = y
#The new dataframe with the clustering
done df1.head()
We can clearly see that 5 different clusters have been formed from the
data. The red cluster is the customers with the least income and least
1
spending score,
similarly, the blue cluster is the customers with the most income and
most spending score.
Now, we shall be working on 3 types of data. Apart from the spending score and
annual income of customers, we shall also take in the age of the customers.
wcss = []
for k in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means+
+") kmeans.fit(X2)
wcss.append(kmeans.inerti
a_)
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,
1)) plt.ylabel("WCSS")
plt.show()
The data:
The output:
cust1=df2[df2["label"]==1]
print('Number of customer in 1st group=',
len(cust1)) print('They are -',
cust1["CustomerID"].values) print("
"
) cust2=df2[df2["label"]==2]
print('Number of customer in 2nd group=',
len(cust2)) print('They are -',
cust2["CustomerID"].values) print("
"
) cust3=df2[df2["label"]==0]
print('Number of customer in 3rd group=', len(cust3))
print('They are -', cust3["CustomerID"].values)
print(" "
) cust4=df2[df2["label"]==3]
2
——————————————–
They are - [ 47 51 55 56 57 60 67 72 77 78 80 82 84 86 90 93 94 97
99 102 105 108 113 118 119 120 122 123 127]
——————————————–
They are - [124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158
160 162 164 166 168 170 172 174 176 178]
——————————————–
——————————————–
Conclusion:
So, we used K-Means clustering to understand customer data. K-Means is a good clustering
algorithm. Almost all the clusters have similar density. It is also fast and efficient in terms of
computational cost.
Assignment No : 12
MINI PROJECT 2
# data processing
import pandas as pd
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import
LogisticRegression from sklearn.ensemble import
RandomForestClassifier from sklearn.linear_model
import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
Data Exploration/Analysis
train_df.info()
From the table above, we can note a few things. First of all,
that we need to convert a lot of features into numeric ones
2
same scale. We can also spot some more features, that contain
missing values (NaN = not a number), that wee need to deal
with.
You can see that men have a high probability of survival when
they are between 18 and 30 years old, which is also a little bit
true for women but not fully. For women the survival chances
are higher between 14 and 40.
For men the probability of survival is very low between the age
of 5 and 18, but that isn‟t true for women. Another thing to
note is that infants also have a little bit higher probability of
survival.
not survive.
axes =
sns.factorplot('relatives','Survived',
Here we can see that you had a high probabilty of survival with
1 to 3 realitves, but a lower one if you had less than 1 or more
than 3 (except for some cases with 6 relatives).
Data Preprocessing
Missing Data:
Cabin:
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
As a reminder, we have to deal with Cabin
Engineering (687), Embarked (2)
Practice-III
Age:
Now we can tackle the issue with the age features missing
values. I will create an array that contains random numbers,
which are computed based on the mean age value in regards to
data = [train_df, test_df]
the standard deviation and is_null.
for dataset in data:
mean = train_df["Age"].mean()
std = test_df["Age"].std()
is_null = dataset["Age"].isnull().sum()
# compute random numbers between the mean, std and is_null
rand_age = np.random.randint(mean - std, mean + std, size
= is_null)
# fill NaN values in Age column with random values generated
age_slice = dataset["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
dataset["Age"] = age_slice
dataset["Age"] =
train_df["Age"].astype(int)train_df["Age"].isnull().sum()
Embarked:
common_value = 'S'
data = [train_df, test_df]
Converting Features:
train_df.info()
Above you can see that „Fare‟ is a float and we have to deal
with 4 categorical features: Name, Sex, Ticket and Embarked.
Lets investigate and transfrom one after another.
Fare:
Converting “Fare” from float to int64, using the “astype()” function
pandas provides:
data = [train_df, test_df]
3
Name:
We will use the Name feature to extract the Titles from the
Name, so that we can build a new feature out of that.
data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in
data:
Sex:
Convert „Sex‟ feature into numeric.
genders = {"male": 0, "female": 1}
data = [train_df, test_df]
Ticket:
train_df['Ticket'].describe()
Since the Ticket attribute has 681 unique tickets, it will be a bit
tricky to convert them into useful categories. So we will drop it
train_df = train_df.drop(['Ticket'],
from the
axis=1) dataset.
test_df = test_df.drop(['Ticket'],
axis=1)
Embarked:
Convert „Embarked‟ feature into numeric.
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]
Creating Categories:
Age:
Now we need to convert the „age‟ feature. First we will convert
it from float into integer. Then we will create the new
„AgeGroup” variable, by categorizing every age into a group.
Note that it is important to place attention on how you form
these groups, since you don‟t want for example that 80% of your
data = [train_df, test_df]
data falls into
for dataset group 1.
in data:
dataset['Age'] = dataset['Age'].astype(int)
dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18),
'Age']
= 1
dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22),
'Age']
= 2
dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27),
'Age']
= 3
dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33),
'Age']
= 4
dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40),
'Age']
= 5
dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66),
Fare:
For the „Fare‟ feature, we need to do the same as with the „Age‟
feature. But it isn‟t that easy, because if we cut the range of
the fare values into a few equally big categories, 80% of the
values would fall into the first category. Fortunately, we can
use sklearn “qcut()” function, that we can use to see,
how we can form the categories.
train_df.head(10)
I will add two new features to the dataset, that I compute out of
other features.
sgd.score(X_train, Y_train)
Random Forest:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100,
2)
3
Logistic Regression:
Y_pred = logreg.predict(X_test)
K Nearest Neighbor:
# KNN knn = KNeighborsClassifier(n_neighbors = 3) knn.fit(X_train,
Y_train) Y_pred = knn.predict(X_test) acc_knn =
round(knn.score(X_train, Y_train) * 100, 2)
Perceptron:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
Y_pred = linear_svc.predict(X_test)
Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train) Y_pred =
decision_tree.predict(X_test) acc_decision_tree =
round(decision_tree.score(X_train, Y_train) * 100, 2)
This looks much more realistic than before. Our model has a
average accuracy of 82% with a standard deviation of 4 %. The
standard deviation shows us, how precise the estimates are .
This means in our case that the accuracy of our model can differ + — 4%.
Random Forest
One big advantage of random forest is, that it can be used for
both classification and regression problems, which form the
majority of current machine learning systems. With a few
exceptions a random-forest classifier has all the
hyperparameters of a decision-tree classifier and also all the
hyperparameters of a bagging classifier, to control the
ensemble itself.
Below you can see how a random forest would look like with two trees:
Feature Importance
importances.plot.bar()
Conclusion:
train_df = train_df.drop("Parch",
axis=1) test_df = test_df.drop("Parch",
axis=1)
random_forest.score(X_train, Y_train)
92.82%
Hyperparameter Tuning
Below you can see the code of the hyperparamter tuning for
the parameters criterion, min_samples_leaf, min_samples_split
and n_estimators.
I put this code into a markdown cell and not into a code cell,
because it takes a long time to run it. Directly underneeth it, I
put a screenshot of the gridsearch's output.
param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf"
: [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16,
18,
25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}from
sklearn.model_selection import GridSearchCV, cross_val_scorerf =
RandomForestClassifier(n_estimators=100, max_features='auto',
oob_score=True, random_state=1, n_jobs=-1)clf =
GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-
random_forest.fit(X_train, Y_train)
Y_prediction =
random_forest.predict(X_test)
4
random_forest.score(X_train, Y_train)
Further Evaluation
Confusion Matrix:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)
4
Precision: 0.801948051948
Recall: 0.722222222222
F-Score
You can combine precision and recall into one score, which is
called the F- score. The F-score is computed with the harmonic
mean of precision and recall. Note that it assigns much more
weight to low values. As a result of that, the classifier will only
get a high F-score, if both recall and precision are high.
from sklearn.metrics import f1_score
f1_score(Y_train, predictions)
0.7599999999999
We will plot the precision and recall with the threshold using matplotlib:
plt.figure(figsize=(14, 7))
plot_precision_and_recall(precision, recall,
threshold) plt.show()
Above you can clearly see that the recall is falling of rapidly at
a precision of around 85%. Because of that you may want to
select the precision/recall tradeoff before that — maybe at
around 75 %.
You are now able to choose a threshold, that gives you the best
precision/recall tradeoff for your current machine learning
problem. If you want for example a precision of 80%, you can
easily look at the plots and see that you would need a threshold
of around 0.4. Then you could train a model with exactly that
threshold and would get the desired accuracy.
Another way is to plot the precision and recall against each other:
plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recall)
plt.show()
plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()
ROC_AUC_SCORE: 0.945067587