Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
Table of Contents
When do we use KNN algorithm?
How does the KNN algorithm work?
How do we choose the factor K?
Breaking it Down – Pseudo Code of KNN
Implementation in Python from scratch
Comparing our model with scikit-learn
2. Calculation time
3. Predictive Power
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and
nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote
from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as
to enclose only three datapoints on the plane. Refer to following diagram for more
details:
The three closest points to BS is all RC. Hence, with good confidence level we can say
that the BS should belong to the class RC. Here, the choice became very obvious as all
three votes from the closest neighbor went to RC. The choice of the parameter K is very
crucial in this algorithm. Next we will understand what are the factors to be considered to
conclude the best K.
As you can see, the error rate at K=1 is always zero for the training sample. This is
because the closest point to any training data point is itself.Hence the prediction is
always accurate with K=1. If validation error curve would have been similar, our choice
of K would have been 1. Following is the validation error curve with varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence,
error rate initially decreases and reaches a minima. After the minima point, it then
increase with increasing K. To get the optimal value of K, you can segregate the training
and validation from the initial dataset. Now plot the validation error curve to get the
optimal value of K. This value of K should be used for all predictions.
# Importing libraries
import pandas as pd
import numpy as np
import math
import operator
# Importing data
data = pd.read_csv("iris.csv")
data.head()
# Defining a function which calculates euclidean distance between two data points
for x in range(length):
return np.sqrt(distance)
distances = {}
sort = {}
length = testInstance.shape[1]
# Calculating euclidean distance between each row of training data and test
data
for x in range(len(trainingSet)):
#### Start of STEP 3.1
distances[x] = dist[0]
neighbors = []
neighbors.append(sorted_d[x][0])
classVotes = {}
for x in range(len(neighbors)):
response = trainingSet.iloc[neighbors[x]][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
reverse=True)
return(sortedVotes[0][0], neighbors)
test = pd.DataFrame(testSet)
k = 1
print(result)
-> Iris-virginica
# Nearest neighbor
print(neigh)
-> [141]
Now we will try to alter the k values, and see how the prediction changes.
k = 3
# Predicted class
# 3 nearest neighbors
print(neigh)
-> [141, 139, 120]
k = 5
# Predicted class
# 5 nearest neighbors
print(neigh)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])
# Predicted class
print(neigh.predict(test))
-> ['Iris-virginica']
# 3 nearest neighbors
print(neigh.kneighbors(test)[1])
We can see that both the models predicted the same class (‘Iris-virginica’) and the same
nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as
expected.
End Notes
KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it
can give highly competitive results. KNN algorithm can also be used for regression
problems. The only difference from the discussed methodology will be using averages of
nearest neighbors rather than voting from nearest neighbors. KNN can be coded in a
single line on R. I am yet to explore how can we use KNN algorithm on SAS.
Did you find the article useful? Have you used any other machine learning tool recently?
Do you plan to use KNN in any of your business problems? If yes, share with us how
you plan to go about it.
What is Principal Component Analysis ?
In simple words, principal component analysis is a method of extracting important
variables (in form of components) from a large set of variables available in a data set. It
extracts low dimensional set of features from a high dimensional data set with a motive
to capture as much information as possible. With fewer variables, visualization also
becomes much more meaningful. PCA is more useful when dealing with 3 or higher
dimensional data.
Let’s say we have a data set of dimension 300 (n) × 50 (p). n represents the number of
observations and prepresents number of predictors. Since we have a large p = 50,
there can be p(p-1)/2 scatter plots i.e more than 1000 plots possible to analyze the
variable relationship. Wouldn’t is be a tedious job to perform exploratory analysis on this
data ?
In this case, it would be a lucid approach to select a subset of p (p << 50) predictor
which captures as much information. Followed by plotting the observation in the
resultant low dimensional space.
The image below shows the transformation of a high dimensional data (3 dimension) to
low dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension
is a linear combination of p features
S
ource: nlpca
where,
Therefore,
The first principal component results in a line which is closest to the data i.e. it minimizes
the sum of squared distance between a data point and the line.
If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of
the components, as expected they are orthogonal. This suggests the correlation b/w
these components in zero.
All
succeeding principal component follows a similar concept i.e. they capture the remaining
variation without being correlated with the previous component. In general, for n ×
p dimensional data, min(n-1, p) principal component can be constructed.
The directions of these components are identified in an unsupervised way i.e. the
response variable(Y) is not used to determine the component direction. Therefore, it
is an unsupervised approach.
Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher
weight to variables which are strongly related to response variable to determine principal
components.
Performing PCA on un-normalized variables will lead to insanely large loadings for
variables with high variance. In turn, this will lead to dependence of a principal
component on the variable with high variance. This is undesirable.
As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.
For this demonstration, I’ll be using the data set from Big Mart Prediction Challenge III.
Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have
done the basic data cleaning prior to implementing this technique. Let’s quickly finish
with initial data loading and cleaning steps:
#directory path
> path <- ".../Data/Big_Mart_Sales"
#add a column
> test$Item_Outlet_Sales <- 1
Till here, we’ve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are
practicing an unsupervised learning technique, hence response variable must be
removed.
Let’s check the available variables ( a.k.a predictors) in the data set.
Since PCA works on numeric variables, let’s see if we have any variable other than
numeric.
Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. We’ll convert these categorical variables into numeric using one hot encoding.
#load library
> library(dummies)
And, we now have all the numerical values. Let’s divide the data into test and train.
The base R function prcomp() is used to perform PCA. By default, it centers the variable
to have mean equals to zero. With parameter scale. = T, we normalize the variables to
have standard deviation equals to 1.
2. The rotation measure provides the principal component loading. Each column of
rotation matrix contains the principal component loading vector. This is the most
important measure we should be interested in.
> prin_comp$rotation
This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set,
the maximum number of principal component loadings is a minimum of (n-1, p). Let’s
look at first 4 principal components and first 5 rows.
> prin_comp$rotation[1:5,1:4]
PC1 PC2 PC3 PC4
Item_Weight 0.0054429225 -0.001285666 0.011246194 0.011887106
Item_Fat_ContentLF -0.0021983314 0.003768557 -0.009790094 -0.016789483
Item_Fat_Contentlow fat -0.0019042710 0.001866905 -0.003066415 -0.018396143
Item_Fat_ContentLow Fat 0.0027936467 -0.002234328 0.028309811 0.056822747
Item_Fat_Contentreg 0.0002936319 0.001120931 0.009033254 -0.001026615
3. In order to compute the principal component score vector, we don’t need to multiply
the loading with data. Rather, the matrix x has the principal component score vectors in
a 8523 × 44 dimension.
> dim(prin_comp$x)
[1] 8523 44
> biplot(prin_comp, scale = 0)
The parameter scale = 0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.
#compute variance
> pr_var <- std_dev^2
We aim to find the components which explain the maximum variance. This is because,
we want to retain as much information as possible using these components. So, higher
is the explained variance, higher will be the information contained in those components.
This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do
we decide how many components should we select for modeling stage ?
The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents
values in descending order.
#scree plot
> plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
The plot above shows that ~ 30 components explains around 98.4% variance in the data
set. In order words, using PCA we have reduced 44 predictors to 30 without
compromising on explained variance. This is the power of PCA> Let’s do a confirmation
check, by plotting a cumulative variance plot. This will give us a clear picture of number
of components.
1. We should not combine the train and test set to obtain PCA components of whole
data at once. Because, this would violate the entire assumption of
generalization since test data would get ‘leaked’ into the training set. In other
words, the test data set would no longer remain ‘unseen’. Eventually, this will
hammer down the generalization capability of the model.
2. We should not perform PCA on test and train data sets separately. Because, the
resultant vectors from train and test PCAs will have different directions ( due to
unequal variance). Due to this, we’ll end up comparing data registered on
different axes. Therefore, the resulting vectors from train and test data should
have same axes.
We should do exactly the same transformation to the test set as we did to training set,
including the center and scaling feature. Let’s do it in R:
That’s the complete modeling process after PCA extraction. I’m sure you wouldn’t be
happy with your leaderboard rank after you upload the solution. Try using random forest!
For Python Users: To implement PCA in python, simply import PCA from sklearn
library. The interpretation remains same as explained for R users above. Ofcourse, the
result is some as derived after using R. The data set used for Python is a cleaned
version where missing values have been imputed, and categorical variables are
converted into numeric. The modeling process remains same, as explained for R users
above.
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
%matplotlib inline
pca = PCA(n_components=44)
pca.fit(X)
print var1
[ 10.37 17.68 23.92 29.7 34.7 39.28 43.67 46.53 49.27
51.92 54.48 57.04 59.59 62.1 64.59 67.08 69.55 72.
74.39 76.76 79.1 81.44 83.77 86.06 88.33 90.59 92.7
94.76 96.78 98.44 100.01 100.01 100.01 100.01 100.01 100.01
100.01 100.01 100.01 100.01 100.01 100.01 100.01 100.01]
plt.plot(var1)
print X1
Points to Remember
1. PCA is used to overcome features redundancy in a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear
combination of original predictor variables.
4. These components aim to capture as much information as possible with high
explained variance.
5. The first component has the highest variance followed by second, third and so
on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See
above.
7. Normalizing data becomes extremely important when the predictors are
measured in different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher
dimensions, it becomes increasingly difficult to make interpretations from the
resultant cloud of data.
9. PCA is applied on a data set with numeric variables.
10. PCA is a tool which helps to produce better visualizations of high dimensional
data.
End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, I’ve
tried to make you familiar with most important concepts required to use this technique.
It’s simple but needs special attention while deciding the number of components.
Practically, we should strive to retain only first few k components
The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.
Think of machine learning algorithms as an armory packed with axes, sword, blades,
bow, dagger etc. You have various tools, but you ought to learn to use them at the right
time. As an analogy, think of ‘Regression’ as a sword capable of slicing and dicing
data efficiently, but incapable of dealing with highly complex data. On the
contrary, ‘Support Vector Machines’ is like a sharp knife – it works on smaller datasets,
but on them, it can be much more stronger and powerful in building models.
By now, I hope you’ve now mastered Random Forest, Naive Bayes
Algorithm and Ensemble Modeling. If not, I’d suggest you to take out few minutes and
read about them as well. In this article, I shall guide you through the basics to advanced
knowledge of a crucial machine learning algorithm, support vector machines.
Table of Contents
1. What is Support Vector Machine?
2. How does it work?
3. How to implement SVM in Python and R?
4. How to tune Parameters of SVM?
5. Pros and Cons associated with SVM
Support Vectors are simply the co-ordinates of individual observation. Support Vector
Machine is a frontier which best segregates the two classes (hyper-plane/ line).
You can look at definition of support vectors and a few examples of its working here.
How does it work?
Above, we got accustomed to the process of segregating the two classes with a hyper-
plane. Now the burning question is “How can we identify the right hyper-plane?”. Don’t
worry, it’s not as hard as you think!
Let’s understand:
In SVM, it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this feature
manually to have a hyper-plane. No, SVM has a technique called
the kernel trick. These are functions which takes low dimensional input space
and transform it to a higher dimensional space i.e. it converts not separable
problem to separable problem, these functions are called kernels. It is mostly
useful in non-linear separation problem. Simply put, it does some extremely
complex data transformations, then find out the process to separate the data
based on the labels or outputs you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
Now, let’s look at the methods to apply SVM algorithm in a data science challenge.
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
value. Will discuss more # about it in next section.Train the model using the
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
The e1071 package in R is used to create Support Vector Machines with ease. It has
helper functions as well as code for the Naive Bayes Classifier. The creation of a support
vector machine in R and Python follow similar approaches, let’s take a look now at the
following code:
#Import Library
# there are various options associated with SVM training; like changing kernel,
# create model
model <-
svm(Target~Predictor1+Predictor2+Predictor3,data=Train,kernel='linear',gamma=0.2,c
ost=100)
#Predict Output
table(preds)
I am going to discuss about some important parameters having higher impact on model
performance, “kernel”, “gamma” and “C”.
kernel: We have already discussed about it. Here, we have various options available
with kernel like, “linear”, “rbf”,”poly” and others (default value is “rbf”). Here “rbf” and
“poly” are useful for non-linear hyper-plane. Let’s look at the example, where we’ve used
linear kernel on two feature of iris data set to classify their class.
import numpy as np
import matplotlib.pyplot as plt
iris = datasets.load_iris()
y = iris.target
# we create an instance of SVM and fit out data. We do not scale our
h = (x_max / x_min)/100
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.show()
I would suggest you to go for linear kernel if you have large number of features (>1000)
because it is more likely that the data is linearly separable in high dimensional space.
Also, you can RBF but do not forget to cross validate for its parameters as to avoid over-
fitting.
gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will
try to exact fit the as per training data set i.e. generalization error and cause over-fitting
problem.
We should always look at the cross validation score to have effective combination of
these parameters and avoid over-fitting.
In R, SVMs can be tuned in a similar fashion as they are in Python. Mentioned below are
the respective parameters for e1071 package:
Practice Problem
Find right additional feature to have a hyper-plane for segregating the classes in below
snapshot:
Answer the variable name in the comments section below. I’ll shall then reveal the
answer.
End Notes
Methods like decision trees, random forest, gradient boosting are being popularly used
in all kinds of data science problems. Hence, for every analyst (fresher also), it’s
important to learn these algorithms and use them for modeling.
This tutorial is meant to help beginners learn tree based modeling from scratch. After the
successful completion of this tutorial, one is expected to become proficient at using tree
based algorithms and build predictive models.
Table of Contents
1.
1. What is a Decision Tree? How does it work?
2. Regression Trees vs Classification Trees
3. How does a tree decide where to split?
4. What are the key parameters of model building and how can we avoid
over-fitting in decision trees?
5. Are tree based models better than linear models?
6. Working with Decision Trees in R and Python
7. What are the ensemble methods of trees based model?
8. What is Bagging? How does it work?
9. What is Random Forest ? How does it work?
10. What is Boosting ? How does it work?
11. Which is more powerful: GBM or Xgboost?
12. Working with GBM in R and Python
13. Working with Xgboost in R and Python
14. Where to Practice ?
1. What is a Decision Tree ? How does it work ?
Decision tree is a type of supervised learning algorithm (having a pre-defined target
variable) that is mostly used in classification problems. It works for both categorical and
continuous input and output variables. In this technique, we split the population or
sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables.
Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl),
Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I
want to create a model to predict who will play cricket during leisure period? In this
problem, we need to segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of
three variable and identify the variable, which creates the best homogeneous sets of
students (which are heterogeneous to each other). In the snapshot below, you can see
that variable Gender is able to identify best homogeneous sets compared to the other
two variables.
As mentioned above, decision tree identifies the most significant variable and it’s value
that gives best homogeneous sets of population. Now the question which arises is, how
does it identify the variable and the split? To do this, decision tree uses various
algorithms, which we will shall discuss in the following section.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two
types:
Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that income of
customer is a significant variable but insurance company does not have income details
for all customers. Now, as we know this is an important variable, then we can build a
decision tree to predict customer income based on occupation, product and various
other variables. In this case, we are predicting values for continuous variable.
1. Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
These are the terms commonly used for decision trees. As we know that every algorithm
has advantages and disadvantages, below are the important factors which one should
know.
Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better
power to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration
stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most
significant variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modeling techniques. It is not influenced by outliers and missing
values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters
and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
2. Regression Trees vs Classification Trees
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree.
This means that decision trees are typically drawn upside down such that leaves are the
the bottom & roots are the tops (shown below).
Both the trees work almost similar to each other, let’s look at the primary differences &
similarity between classification and regression trees:
Decision trees use multiple algorithms to decide to split a node in two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In
other words, we can say that purity of the node increases with respect to the target
variable. Decision tree splits the nodes on all available variables and then selects the
split which results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the four
most commonly used algorithms in decision tree:
Gini Index
Gini index says, if we select two items from a population at random then they must be of
same class and probability for this is 1 if population is pure.
1. Calculate Gini for sub-nodes, using formula sum of square of probability for
success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to example used above, where we want to segregate the students
based on target variable ( playing cricket or not ). In the snapshot below, we split the
population using two input variables Gender and Class. Now, I want to identify which
split is producing more homogeneous sub-nodes using Gini index.
S
plit on Gender:
Above, you can see that Gini score for Split on Gender is higher than Split on
Class, hence, the node split will take place on Gender.
Chi-Square
It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of
standardized differences between observed and expected frequencies of target variable.
1. Calculate Chi-square for individual node by calculating the deviation for Success
and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure
of each node of the split
Example: Let’s work with above example that we have used to calculate Gini.
Split on Gender:
1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it
would be 5 for both because parent node has probability of 50% and we have
applied same probability on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play
Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using
formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer
below table for calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
Perform similar steps of calculation for split on Class and you will come up with below
table.
A
bove, you can see that Chi-square also identify the Gender split is more significant
compare to Class.
Information Gain:
Look at the image below and think which node can be described easily. I am sure, your
answer is C because it requires less information as all values are similar. On the other
hand, B requires more information to describe it and A requires the maximum
information. In other words, we can say that C is a Pure node, B is less Impure and A is
more impure.
Now, we can build a conclusion that less impure node requires less information to
describe it. And, more impure node requires more information. Information theory is a
measure to define this degree of disorganization in a system known as Entropy. If the
sample is completely homogeneous, then the entropy is zero and if the sample is an
equally divided (50% – 50%), it has entropy of one.
Here p and q is probability of success and failure respectively in that node. Entropy is
also used with categorical target variable. It chooses the split which has lowest entropy
compared to parent node and other splits. The lesser the entropy, the better it is.
Example: Let’s use this method to identify best split for student example.
1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1
shows that it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for
male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 +
(20/30)*0.93 = 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for
Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree
will split on Gender. We can derive information gain from entropy as 1- Entropy.
Reduction in Variance
Till now, we have discussed the algorithms for categorical target variable. Reduction in
variance is an algorithm used for continuous target variables (regression problems). This
algorithm uses the standard formula of variance to choose the best split. The split with
lower variance is selected as the criteria to split the population:
Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket.
Now follow the steps to identify the right split:
1. Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have
15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-
0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-
0.5)^2) / 30 = 0.25
2. Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-
0.2)^2) / 10 = 0.16
3. Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-
0.65)^2) / 20 = 0.23
4. Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 +
(20/30) *0.23 = 0.21
5. Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-
0.43)^2) / 14= 0.24
6. Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-
0.56)^2) / 16 = 0.25
7. Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25
Above, you can see that Gender split has lower variance compare to parent node, so the
split would take place on Gender variable.
Until here, we learnt about the basics of decision trees and the decision making process
involved to choose the best splits in building a tree model. As I said, decision tree can be
applied both on regression and classification problems. Let’s understand these aspects
in detail.
4. What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision trees. If there is
no limit set of a decision tree, it will give you 100% accuracy on training set because in
the worse case it will end up making 1 leaf for each observation. Thus, preventing
overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:
The parameters used for defining a tree are further explained below. The parameters
described below are irrespective of tool. It is important to understand the role of
parameters used in tree modeling. These parameters are available in R & Python.
Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-approach. In other
words, it will check for the best split instantaneously and move forward until one of the
specified stopping condition is reached. Let’s consider the following case when you’re
driving:
At this instant, you are the yellow car and you have 2 choices:
This is exactly the difference between normal decision tree & pruning. A decision tree
with constraints won’t see the truck ahead and adopt a greedy approach by taking a left.
On the other hand if we use pruning, we in effect look at a few steps ahead and make a
choice.
So we know pruning is better. But how to implement it in decision tree? The idea is
simple.
Actually, you can use any algorithm. It is dependent on the type of problem you are
solving. Let’s look at some key factors which will help you to decide which algorithm to
use:
For R users, there are multiple packages available to implement decision tree such as
ctree, rpart, tree etc.
> library(rpart)
# grow tree
> summary(fit)
#Predict Output
#Import Library
#Import other necessary libraries like pandas, numpy...
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
you can change the algorithm as gini or entropy (information gain) by default it
is gini
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
7. What are ensemble methods in tree based
modeling ?
The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of
predictive models to achieve a better accuracy and model stability. Ensemble methods
are known to impart supreme boost to tree based models.
Like every other model, a tree based model also suffers from the plague of bias and
variance. Bias means, ‘how much on an average are the predicted values different from
the actual value.’ Variance means, ‘how different will the predictions of the model be at
the same point if different samples are taken from the same population’.
You build a small tree and you will get a model with low variance and high bias. How do
you manage to balance the trade off between bias and variance ?
Normally, as you increase the complexity of your model, you will see a reduction in
prediction error due to lower bias in the model. As you continue to make your model
more complex, you end up over-fitting your model and your model will start suffering
from high variance.
A champion model should maintain a balance between these two types of errors. This is
known as the trade-off management of bias-variance errors. Ensemble learning is one
way to execute this trade off analysis.
Some
of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In
this tutorial, we’ll focus on Bagging and Boosting in detail.
8. What is Bagging? How does it work?
Bagging is a technique used to reduce the variance of our predictions by combining the
result of multiple classifiers modeled on different sub-samples of the same data set. The
following figure will make it clearer:
1. Assume number of cases in the training set is N. Then, sample of these N cases
is taken at random but with replacement. This sample will be the training set for
growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node,
m variables are selected at random out of the M. The best split on these m is
used to split the node. The value of m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority
votes for classification, average for regression).
To understand more in detail about this algorithm using a case study, please read
this article “Introduction to Random forest – Simplified“.
It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing.
It has methods for balancing errors in data sets where classes are imbalanced.
The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
Random Forest involves sampling of the input data with replacement called as
bootstrap sampling. Here one third of the data is not used for training and can be
used to testing. These are called the out of bag samples. Error estimated on
these out of bag samples is known as out of bag error. Study of error estimates
by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate
as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.
#Import Library
regression problem
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
model= RandomForestClassifier(n_estimators=1000)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
> library(randomForest)
# Fitting model
> fit <- randomForest(Species ~ ., x,ntree=500)
> summary(fit)
#Predict Output
How would you classify an email as SPAM or not? Like everyone else, our initial
approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do
you think these rules individually are strong enough to successfully classify an email?
No.
Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak
learner using methods like:
To find weak rule, we apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak prediction rule.
This is an iterative process. After many iterations, the boosting algorithm combines these
weak rules into a single strong prediction rule.
Here’s another question which might haunt you, ‘How do we choose different distribution
for each round?’
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention
to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we
pay higher attention to observations having prediction error. Then, we apply the next
base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher
accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which
eventually improves the prediction power of the model. Boosting pays higher focus on
examples which are mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy.
In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient
Boosting (GBM) and XGboost.
12. Working with GBM in R and Python
Before we start working, let’s quickly understand the important parameters and the
working of this algorithm. This will be helpful for both R and Python users. Below is the
overall pseudo-code of GBM algorithm for 2 classes:
2.1 Update the weights for targets based on previous run (higher for the ones
mis-classified)
2.4 Update the output with current results taking into account the learning rate
This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will
help every beginners to understand this algorithm.
1. learning_rate
o This determines the impact of each tree on the final outcome (step 2.4).
GBM works by starting with an initial estimate which is updated using the
output of each tree. The learning parameter controls the magnitude of this
change in the estimates.
o Lower values are generally preferred as they make the model robust to
the specific characteristics of tree and thus allowing it to generalize well.
o Lower values would require higher number of trees to model all the
relations and will be computationally expensive.
2. n_estimators
o The number of sequential trees to be modeled (step 2)
o Though GBM is fairly robust at higher number of trees but it can still overfit
at a point. Hence, this should be tuned using CV for a particular learning
rate.
3. subsample
o The fraction of observations to be selected for each tree. Selection is done
by random sampling.
o Values slightly less than 1 make the model robust by reducing the
variance.
o Typical values ~0.8 generally work fine but can be fine-tuned further.
Apart from these, there are certain miscellaneous parameters which affect overall
functionality:
1. loss
o It refers to the loss function to be minimized in each split.
o It can have various values for classification and regression case.
Generally the default values work fine. Other values should be chosen
only if you understand their impact on the model.
2. init
o This affects initialization of the output.
o This can be used if we have made another model whose outcome is to be
used as the initial estimates for GBM.
3. random_state
o The random number seed so that same random numbers are generated
every time.
o This is important for parameter tuning. If we don’t fix the random number,
then we’ll have different outcomes for subsequent runs on the same
parameters and it becomes difficult to compare models.
o It can potentially result in overfitting to a particular random sample
selected. We can try running models for different random samples, which
is computationally expensive and generally not used.
4. verbose
o The type of output to be printed when the model fits. The different values
can be:
0: no output generated (default)
1: output generated for trees in certain intervals
>1: output generated for all trees
5. warm_start
o This parameter has an interesting application and can help a lot if used
judicially.
o Using this, we can fit additional trees on previous fits of a model. It can
save a lot of time and you should explore this option for advanced
applications
6. presort
o Select whether to presort data for faster splits.
o It makes the selection automatically by default but it can be changed if
needed.
I know its a long list of parameters but I have simplified it for you in an excel file which
you can download from this GitHub repository.
For R users, using caret package, there are 3 main tuning parameters:
1. n.trees – It refers to number of iterations i.e. tree which will be taken to grow the
trees
2. interaction.depth – It determines the complexity of the tree i.e. total number of
splits it has to perform on a tree (starting from a single node)
3. shrinkage – It refers to the learning rate. This is similar to learning_rate in python
(shown above).
4. n.minobsinnode – It refers to minimum number of training samples required in a
node to perform splitting
> library(caret)
n.trees = 500,
shrinkage = 0.1,
n.minobsinnode = 10)
> set.seed(825)
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
GBM in Python
#import libraries
clf.fit(X_train, y_train)
Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to
get you started. Check Tutorial.
Till here, you’ve got gained significant knowledge on tree based models along with these
practical implementation. It’s time that you start working on them. Here are open practice
problems where you can participate and check your live rankings on leaderboard:
End Notes
Tree based algorithm are important for every data scientist to learn. In fact, tree models
are known to provide the best model performance in the family of whole machine
learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we
come to the end of this tutorial.
We discussed about tree based modeling from scratch. We learnt the important of
decision tree and how that simplistic concept is being used in boosting algorithms. For
better understanding, I would suggest you to continue practicing these algorithms
practically. Also, do keep note of the parameters associated with boosting algorithms.
I’m hoping that this tutorial would enrich you with complete knowledge on tree based
modeling.
Ultimate guide to deal with Text Data
(using Python) – for Data Scientists &
Engineers
Introduction
One of the biggest breakthroughs required for achieving any level of artificial intelligence
is to have machines which can process text data. Thankfully, the amount of text data
being generated in this universe has exploded exponentially in the last few years.
In this article we will discuss different feature extraction methods, starting with some
basic techniques which will lead into advanced Natural Language Processing
techniques. We will also learn about pre-processing of the text data in order to extract
better features from clean data.
By the end of this article, you will be able to perform text operations by yourself. Let’s get
started!
Table of Contents:
1. Basic feature extraction using text data
o Number of words
o Number of characters
o Average word length
o Number of stopwords
o Number of special characters
o Number of numerics
o Number of uppercase words
2. Basic Text Pre-processing of text data
o Lower casing
o Punctuation removal
o Stopwords removal
o Frequent words removal
o Rare words removal
o Spelling correction
o Tokenization
o Stemming
o Lemmatization
3. Advance Text Processing
o N-grams
o Term Frequency
o Inverse Document Frequency
o Term Frequency-Inverse Document Frequency (TF-IDF)
o Bag of Words
o Sentiment Analysis
o Word Embedding
Before starting, let’s quickly read the training file from the dataset in order to perform
different tasks on it. In the entire article, we will use the twitter sentiment dataset from
the datahack platform.
train = pd.read_csv('train_E6oV3lV.csv')
Note that here we are only working with textual data, but we can also use the below
methods when numerical features are also present along with the text.
train[['tweet','word_count']].head()
train[['tweet','char_count']].head()
Note that the calculation will also include the number of spaces, which you can remove,
if required.
1.3 Average Word Length
We will also extract another feature which will calculate the average word length of each
tweet. This can also potentially help us in improving our model.
Here, we simply take the sum of the length of all the words and divide it by the total
length of the tweet:
def avg_word(sentence):
words = sentence.split()
train[['tweet','avg_word']].head()
Here, we have imported stopwords from NLTK, which is a basic NLP library in python.
from nltk.corpus import stopwords
stop = stopwords.words('english')
in stop]))
train[['tweet','stopwords']].head()
Here, we make use of the ‘starts with’ function because hashtags (or mentions) always
appear at the beginning of a word.
x.startswith('#')]))
train[['tweet','hastags']].head()
x.isdigit()]))
train[['tweet','numerics']].head()
x.isupper()]))
train[['tweet','upper']].head()
2. Basic Pre-processing
So far, we have learned how to extract basic features from text data. Before diving into
text and feature extraction, our first step should be cleaning the data in order to obtain
better features. We will achieve this by doing some of the basic pre-processing steps on
our training data.
x.split()))
train['tweet'].head()
2.2 Removing Punctuation
The next step is to remove punctuation, as it doesn’t add any extra information while
treating text data. Therefore removing all instances of it will help us reduce the size of
the training data.
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()
As you can see in the above output, all the punctuation, including ‘#’ and ‘@’, has been
removed from the training data.
not in stop))
train['tweet'].head()
freq
love 2647
ð 2511
day 2199
â 1797
happy 1663
amp 1582
im 1139
u 1136
time 1110
dtype: int64
Now, let’s remove these words as their presence will not of any use in classification of
our text data.
freq = list(freq.index)
not in freq))
train['tweet'].head()
freq
> tvperfect 1
oau 1
850am 1
semangatpagi 1
kindestbravest 1
moodyah 1
downhill 1
loreal 1
ohwhatcoulditbe 1
maannnn 1
dtype: int64
freq = list(freq.index)
not in freq))
train['tweet'].head()
All these pre-processing steps are essential and help us in reducing our vocabulary
clutter so that the features produced in the end are more effective.
In that regard, spelling correction is a useful pre-processing step because this also will
help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will
be treated as different words even if they are used in the same sense.
To achieve this we will use the textblob library. If you are not familiar with it, you can
check my previous article on ‘NLP for beginners using textblob’.
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))
Note that it will actually take a lot of time to make these corrections. Therefore, just for
the purposes of learning, I have shown this technique by applying it on only the first 5
rows. Moreover, we cannot always expect it to be accurate so some care should be
taken before applying it.
We should also keep in mind that words are often used in their abbreviated form. For
instance, ‘your’ is used as ‘ur’. We should treat this before the spelling correction step,
otherwise these words might be transformed into any other word like the one shown
below:
2.7 Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences. In our
example, we have used the textblob library to first transform our tweets into a blob and
then converted them into a series of words.
TextBlob(train['tweet'][1]).words
2.8 Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based
approach. For this purpose, we will use PorterStemmer from the NLTK library.
st = PorterStemmer()
x.split()]))
0 father dysfunct selfish drag kid dysfunct run
2 bihday majesti
2.9 Lemmatization
Lemmatization is a more effective option than stemming because it converts the word
into its root word, rather than just stripping the suffices. It makes use of the vocabulary
and does a morphological analysis to obtain the root word. Therefore, we usually prefer
using lemmatization over stemming.
train['tweet'].head()
3.1 N-grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are
called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.
So, let’s quickly extract bigrams from our tweets using the ngrams function of the
textblob library.
TextBlob(train['tweet'][0]).ngrams(2)
WordList(['when', 'a']),
WordList(['a', 'father']),
WordList(['father', 'is']),
WordList(['is', 'dysfunctional']),
WordList(['dysfunctional', 'and']),
WordList(['and', 'is']),
WordList(['is', 'so']),
WordList(['so', 'selfish']),
WordList(['selfish', 'he']),
WordList(['he', 'drags']),
WordList(['drags', 'his']),
WordList(['his', 'kids']),
WordList(['kids', 'into']),
WordList(['into', 'his']),
WordList(['his', 'dysfunction']),
WordList(['dysfunction', 'run'])]
3.2 Term frequency
Term frequency is simply the ratio of the count of a word present in a sentence, to the
length of the sentence.
Below, I have tried to show you the term frequency table of a tweet.
"))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1
Therefore, the IDF of each word is the log of the ratio of the total number of rows to the
number of rows in which that word is present.
IDF = log(N/n), where, N is the total number of rows and n is the number of rows in
which the word was present.
So, let’s calculate IDF for the same tweets for which we calculated the term frequency.
tf1.loc[i, 'idf'] =
np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))
tf1
The more the value of IDF, the more unique is the word.
tf1
We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because
they are commonly occurring words. However, it has given a high weight to
“disappointed” since that will be very useful in determining the sentiment of the tweet.
We don’t have to calculate TF and IDF every time beforehand and then multiply it to
obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:
stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])
train_vect
We can also perform basic pre-processing steps like lower-casing and removal of
stopwords, if we haven’t done them earlier.
ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])
train_bow
3.6 Sentiment Analysis
If you recall, our problem was to detect the sentiment of the tweet. So, before applying
any ML/DL models (which can have a separate feature detecting the sentiment using the
textblob library), let’s check the sentiment of the first few tweets.
train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)
0 (-0.3, 0.5354166666666667)
1 (0.2, 0.2)
2 (0.0, 0.0)
3 (0.0, 0.0)
4 (0.0, 0.0)
Above, you can see that it returns a tuple representing polarity and subjectivity of each
tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1
means a positive sentiment and values nearer to -1 means a negative sentiment. This
can also work as a feature for building a machine learning model.
train[['tweet','sentiment']].head()
Word2Vec models require a lot of text, so either we can train it on our training data or we
can use the pre-trained word vectors developed by Google, Wiki, etc.
Here, we will use pre-trained word vectors which can be downloaded from
the glove website. There are different dimensions (50,100, 200, 300) vectors trained on
wiki data. For this example, I have downloaded the 100-dimensional version of the
model.
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)
>(400000, 100)
Now, we can load the above word2vec file as a model.
filename = 'glove.6B.100d.txt.word2vec'
Let’s say our tweet contains a text saying ‘go away’. We can easily obtain it’s word
vector using the above model:
model['go']
model['away']
We then take the average to represent the string ‘go away’ in the form of vectors having
100 dimensions.
(model['go'] + model['away'])/2
dtype=float32)
We have converted the entire string into a vector which can now be used as a feature in
any modelling technique.
End Notes
I hope that now you have a basic understanding of how to deal with text data in
predictive modeling. These methods will help in extracting more information which in
return will help you in building better models.
Topics Covered
1. Type of Recommendation Engines
2. The MovieLens DataSet
3. A simple popularity model
4. A Collaborative Filtering Model
5. Evaluating Recommendation Engines
Basically the most popular items would be same for each user since popularity is defined
on the entire user pool. So everybody will see the same results. It sounds like, ‘a website
recommends you to buy microwave just because it’s been liked by other users and
doesn’t care if you are even interested in buying or not’.
Surprisingly, such approach still works in places like news portals. Whenever you login
to say bbcnews, you’ll see a column of “Popular News” which is subdivided into sections
and the most read articles of each sections are displayed. This approach can work in
this case because:
There is division by section so user can look at the section of his interest.
At a time there are only a few hot topics and there is a high chance that a user
wants to read the news which is being read by most others
Incorporates personalization
It can work even if the user’s past history is short or not available
But has some major drawbacks as well because of which it is not used much in practice:
The features might actually not be available or even if they are, they may not be
sufficient to make a good classifier
As the number of users and items grow, making a good classifier will become
exponentially difficult
1. Content based algorithms:
o Idea: If you like an item then you will also like a “similar” item
o Based on similarity of the items being recommended
o It generally works well when its easy to determine the context/properties of
each item. For instance when we are recommending the same kind of
item like a movie recommendation or song recommendation.
2. Collaborative filtering algorithms:
o Idea: If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar
interests and A should like item 4 and B should like item 1.
o This algorithm is entirely based on the past behavior and not on the
context. This makes it one of the most commonly used algorithm as it is
not dependent on any additional information.
o For instance: product recommendations by e-commerce player like
Amazon and merchant recommendations by banks like American
Express.
o Further, there are several types of collaborative filtering algorithms :
1. User-User Collaborative filtering: Here we find look alike
customers (based on similarity) and offer products which first
customer’s look alike has chosen in past. This algorithm is very
effective but takes a lot of time and resources. It requires to
compute every customer pair information which takes time.
Therefore, for big base platforms, this algorithm is hard to
implement without a very strong parallelizable system.
2. Item-Item Collaborative filtering: It is quite similar to
previous algorithm, but instead of finding customer look alike, we
try finding item look alike. Once we have item look alike matrix, we
can easily recommend alike items to customer who have
purchased any item from the store. This algorithm is far less
resource consuming than user-user collaborative filtering. Hence,
for a new customer the algorithm takes far lesser time than user-
user collaborate as we don’t need all similarity scores between
customers. And with fixed number of products, product-product
look alike matrix is fixed over time.
3. Other simpler algorithms: There are other approaches
like market basket analysis, which generally do not have high
predictive power than the algorithms described above.
Lets load this data into Python. There are many files in the ml-100k.zip file which we
can use. Lets load the three most importance files to get a sense of the data. I also
recommend you to read the readme document which gives a lot of information about the
difference files.
import pandas as pd
# pass in column names for each CSV and read them using pandas.
encoding='latin-1')
#Reading ratings file:
encoding='latin-1')
#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb
'War', 'Western']
encoding='latin-1')
Now lets take a peak into the content of each file to understand them better.
Users
print users.shape
users.head()
This reconfirms that there are 943 users and we have 5 features for each namely their
unique ID, age, gender, occupation and the zip code they are living in.
Ratings
print ratings.shape
ratings.head()
This confirms that there are 100K ratings for different user and movie combinations. Also
notice that each rating has a timestamp associated with it.
Items
print items.shape
items.head()
This dataset contains attributes of the 1682 movies. There are 24 columns out of which
19 specify the genre of a particular movie. The last 19 columns are for each genre and a
value of 1 denotes movie belongs to that genre and 0 otherwise.
Now we have to divide the ratings data set into test and train data for making models.
Luckily GroupLens provides pre-divided data wherein the test data has 10 ratings for
each user, i.e. 9430 rows in total. Lets load that:
names=r_cols, encoding='latin-1')
names=r_cols, encoding='latin-1')
ratings_base.shape, ratings_test.shape
import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)
We can use this data for training and testing. Now that we have gathered all the data
available. Note that here we have user behaviour as well as attributes of the users and
movies. So we can make content based as well as collaborative filtering algorithms.
popularity_model = graphlab.popularity_recommender.create(train_data,
Arguments:
Lets use this model to make top 5 recommendations for first 5 users and see what
comes out:
#Get recommendations for first 5 users and print them
popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)
popularity_recomm.print_rows(num_rows=25)
Did you notice something? The recommendations for all users are same –
1500,1201,1189,1122,814 in the same order. This can be verified by checking the
movies with highest mean recommendations in our ratings_base data set:
ratings_base.groupby(by='movie_id')
['rating'].mean().sort_values(ascending=False).head(20)
This confirms that all the recommended movies have an average rating of 5, i.e. all the
users who watched the movie gave a top rating. Thus we can see that our popularity
system works as expected. But it is good enough? We’ll analyze it in detail later.
4. A Collaborative Filtering Model
Lets start by understanding the basics of a collaborative filtering algorithm. The core idea
works in 2 steps:
To give you a high level overview, this is done by making an item-item matrix in which
we keep a record of the pair of items which were rated together.
In this case, an item is a movie. Once we have the matrix, we use it to determine the
best recommendations for a user based on the movies he has already rated. Note that
there a few more things to take care in actual implementation which would require
deeper mathematical introspection, which I’ll skip for now.
I would just like to mention that there are 3 types of item similarity metrics supported by
graphlab. These are:
1. Jaccard Similarity:
o Similarity is based on the number of users which have rated item A and B
divided by the number of users who have rated either A or B
o It is typically used where we don’t have a numeric rating but just a boolean
value like a product being bought or an add being clicked
2. Cosine Similarity:
o Similarity is the cosine of the angle between the 2 vectors of the item
vectors of A and B
o Closer the vectors, smaller will be the angle and larger the cosine
3. Pearson Similarity
o Similarity is the pearson coefficient between the two vectors.
#Train Model
item_sim_model = graphlab.item_similarity_recommender.create(train_data,
#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)
item_sim_recomm.print_rows(num_rows=25)
Here we can see that the recommendations are different for each user. So,
personalization exists. But how good is this model? We need some means of evaluating
a recommendation engine. Lets focus on that in the next section.
5. Evaluating Recommendation Engines
For evaluating recommendation engines, we can use the concept of precision-recall.
You must be familiar with this in terms of classification and the idea is very similar. Let
me define them in terms of recommendations.
Recall:
o What ratio of items that a user likes were actually recommended.
o If a user likes say 5 items and the recommendation decided to show 3 of
them, then the recall is 0.6
Precision
o Out of all the recommended items, how many the user actually liked?
o If 5 items were recommended to the user out of which he liked say 4 of
them, then precision is 0.8
Now if we think about recall, how can we maximize it? If we simply recommend all the
items, they will definitely cover the items which the user likes. So we have 100% recall!
But think about precision for a second. If we recommend say 1000 items and user like
only say 10 of them then precision is 0.1%. This is really low. Our aim is to maximize
both precision and recall.
An idea recommender system is the one which only recommends the items which user
likes. So in this case precision=recall=1. This is an optimal recommender and we should
try and get as close as possible.
Lets compare both the models we have built till now based on precision-recall
characteristics:
item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])
Here we can make 2 very quick observations:
1. The item similarity model is definitely better than the popularity model (by atleast
10x)
2. On an absolute level, even the item similarity model appears to have a poor
performance. It is far from being a useful recommendation system.
There is a big scope of improvement here. But I leave it up to you to figure out how to
improve this further. I would like to give a couple of tips:
In the end, I would like to mention that along with GraphLab, you can also use some
other open source python packages like the following:
Crab.
Surprise
Python Recsys
MRec
End Notes
In this article, we traversed through the process of making a basic recommendation
engine in Python using GrpahLab. We started by understanding the fundamentals of
recommendations. Then we went on to load the MovieLens 100K data set for the
purpose of experimentation.
Subsequently we made a first model as a simple popularity model in which the most
popular movies were recommended for each user. Since this lacked personalization, we
made another model based on collaborative filtering and observed the impact of
personalization.
Getting Started with Audio Data
Analysis using Deep Learning (with
case study)
Introduction
When you get started with data science, you start simple. You go through simple
projects like Loan Prediction problem or Big Mart Sales Prediction. These problems have
structured data arranged neatly in a tabular format. In other words, you are spoon-fed
the hardest part in data science pipeline.
You first have to understand it, collect it from various sources and arrange it in a format
which is ready for processing. This is even more difficult when the data is in an
unstructured format such as image or audio. This is so because you would have to
represent image/audio data in a standard way for it to be useful for analysis.
Also the body language of the person can show you many more features about a
person, because actions speak louder than words! So in short, unstructured data is
complex but processing it can reap easy rewards.
In this article, I intend to cover an overview of audio / voice processing with a case study
so that you would get a hands-on introduction to solving audio processing problems.
Table of Contents
What do you mean by Audio data?
o Applications of Audio Processing
Data Handling in Audio domain
Let’s solve the UrbanSound challenge!
Intermission: Our first submission
Let’s solve the challenge! Part 2: Building better models
Future Steps to explore
So can you somehow catch this audio floating all around you to do something
constructive? Yes, of course! There are devices built which help you catch these sounds
and represent it in computer readable format. Examples of these formats are
If you give a thought on what an audio looks like, it is nothing but a wave like format of
data, where the amplitude of audio change with respect to time. This can be pictorial
represented as follows.
Here’s an exercise for you; can you think of an application of audio processing that can
potentially help thousands of lives?
The first step is to actually load the data into a machine understandable format. For this,
we simply take values after every specific time steps. For example; in a 2 second audio
file, we extract values at half a second. This is called sampling of audio data, and the
rate at which it is sampled is called the sampling rate.
Another way of representing audio data is by converting it into a different domain of data
representation, namely the frequency domain. When we sample an audio data, we
require much more data points to represent the whole data and also, the sampling rate
should be as high as possible.
On the other hand, if we represent audio data in frequency domain, much less
computational space is required. To get an intuition, take a look at the image below
Here, we separate one audio signal into 3 different pure signals, which can now be
represented as three unique values in frequency domain.
There are a few more ways in which audio data can be represented, for example. using
MFCs (Mel-Frequency cepstrums. PS: We will cover this in the later article). These are
nothing but different ways to represent the data.
Now the next step is to extract features from this audio representations, so that our
algorithm can work on these features and perform the task it is designed for. Here’s a
visual representation of the categories of audio features that can be extracted.
After extracting these features, it is then sent to the machine learning model for further
analysis.
The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes,
namely:
air conditioner,
car horn,
children playing,
dog bark,
drilling,
engine idling,
gun shot,
jackhammer,
siren, and
street music
Here’s a sound excerpt from the dataset. Can you guess which class does it belong to?
Audio Player
00:00
00:04
To play this in the jupyter notebook, you can simply follow along with the code.
ipd.Audio('../data/Train/2022.wav')
Now let us load this audio in our notebook as a numpy array. For this, we will use librosa
library in python. To install librosa, just type this in command line
When you load the data, it gives you two objects; a numpy array of an audio file and the
corresponding sampling rate by which it was extracted. Now to represent this as a
waveform (which it originally is), use the following code
% pylab inline
import os
import pandas as pd
import librosa
import glob
plt.figure(figsize=(12, 4))
librosa.display.waveplot(data, sr=sampling_rate)
Let us now visually inspect our data and see if we can find patterns in the data
Class: jackhammer
Class: drilling
Class: dog_barking
We can see that it may be difficult to differentiate between jackhammer and drilling, but it
is still easy to discern between dog_barking and drilling. To see more such examples,
you can use this code
i = random.choice(train.index)
audio_name = train.ID[i]
plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr)
train.Class.value_counts()
Out[10]:
jackhammer 0.122907
engine_idling 0.114811
siren 0.111684
dog_bark 0.110396
air_conditioner 0.110396
children_playing 0.110396
street_music 0.110396
drilling 0.110396
car_horn 0.056302
gun_shot 0.042318
We see that jackhammer class has more values than any other class. So let us create
our first submission with this idea.
test = pd.read_csv('../data/test.csv')
test['Class'] = 'jackhammer'
test.to_csv(‘sub01.csv’, index=False)
This seems like a good idea as a benchmark for any challenge, but for this problem, it
seems a bit unfair. This is so because the dataset is not much imbalanced.
def parser(row):
try:
except Exception as e:
feature = mfccs
label = row.Class
Step 3: Convert the data to pass it in our deep learning model
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
import numpy as np
num_labels = y.shape[1]
filter_size = 2
# build model
model = Sequential()
model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
optimizer='adam')
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Seems ok, but the score can be increased obviously. (PS: I could get an accuracy of
80% on my validation dataset). Now its your turn, can you increase on this score? If you
do, let me know in the comments below!
End Notes
In this article, I have given a brief overview of audio processing with an case study on
UrbanSound challenge. I have also shown the steps you perform when dealing with
audio data in python with librosa package. Giving this “shastra” in your hand, I hope you
could try your own algorithms in Urban Sound challenge, or try solving your own audio
problems in daily life.