Tutorial 6
Tutorial 6
Tutorial 6
1 Module 6: Classification
The following tutorial contains Python examples for solving classification problems. You should
refer to the Chapters 3 and 4 of the “Introduction to Data Mining” book to understand some of
the concepts introduced in this tutorial.
Classification is the task of predicting a nominal-valued attribute (known as class label) based on
the values of other attributes (known as predictor variables). The goals for this tutorial are as
follows: 1. To provide examples of using different classification techniques from the scikit-learn
library package. 2. To demonstrate the problem of model overfitting.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding
cell and press the SHIFT-ENTER keys simultaneously.
[ ]: import pandas as pd
data = pd.read_csv('vertebrate.csv',header='infer')
data
Given the limited number of training examples, suppose we convert the problem into a binary
classification task (mammals versus non-mammals). We can do so by replacing the class labels of
the instances to non-mammals except for those that belong to the mammals class.
[ ]: data['Class'] = data['Class'].
,→replace(['fishes','birds','amphibians','reptiles'],'non-mammals')
data
We can apply Pandas cross-tabulation to examine the relationship between the Warm-blooded and
Gives Birth attributes with respect to the class.
[ ]: pd.crosstab([data['Warm-blooded'],data['Gives Birth']],data['Class'])
1
The results above show that it is possible to distinguish mammals from non-mammals using these
two attributes alone since each combination of their attribute values would yield only instances that
belong to the same class. For example, mammals can be identified as warm-blooded vertebrates
that give birth to their young. Such a relationship can also be derived using a decision tree classifier,
as shown by the example given in the next subsection.
Y = data['Class']
X = data.drop(['Name','Class'],axis=1)
clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=3)
clf = clf.fit(X, Y)
The preceding commands will extract the predictor (X) and target class (Y) attributes from the
vertebrate dataset and create a decision tree classifier object using entropy as its impurity measure
for splitting criterion. The decision tree class in Python sklearn library also supports using ‘gini’
as impurity measure. The classifier above is also constrained to generate trees with a maximum
depth equals to 3. Next, the classifier is trained on the labeled data using the fit() function.
We can plot the resulting decision tree obtained after training the classifier. To do this, you must
first install both graphviz (http://www.graphviz.org) and its Python interface called pydotplus
(http://pydotplus.readthedocs.io/).
[ ]: import pydotplus
from IPython.display import Image
out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Next, suppose we apply the decision tree to classify the following test examples.
We first extract the predictor and target class attributes from the test data and then apply the
decision tree classifier to predict their classes.
2
[ ]: testY = testData['Class']
testX = testData.drop(['Name','Class'],axis=1)
predY = clf.predict(testX)
predictions = pd.concat([testData['Name'],pd.Series(predY,name='Predicted␣
,→Class')], axis=1)
predictions
Except for platypus, which is an egg-laying mammal, the classifier correctly predicts the class label
of the test examples. We can calculate the accuracy of the classifier on the test data as shown by
the example given below.
[ ]: import numpy as np
import matplotlib.pyplot as plt
from numpy.random import random
%matplotlib inline
N = 1500
np.random.seed(50)
X = np.random.multivariate_normal(mean1, cov, int(N/6))
X = np.concatenate((X, np.random.multivariate_normal(mean2, cov, int(N/6))))
X = np.concatenate((X, np.random.multivariate_normal(mean3, cov, int(N/6))))
X = np.concatenate((X, 20*np.random.rand(int(N/2),2)))
Y = np.concatenate((np.ones(int(N/2)),np.zeros(int(N/2))))
3
plt.plot(X[:int(N/2),0],X[:int(N/2),1],'r+',X[int(N/2):,0],X[int(N/2):,1],'k.
,→',ms=4)
In this example, we reserve 80% of the labeled data for training and the remaining 20% for testing.
We then fit decision trees of different maximum depths (from 2 to 50) to the training set and plot
their respective accuracies when applied to the training and test sets.
[ ]: #########################################
# Training and Test set creation
#########################################
#########################################
# Model fitting and evaluation
#########################################
maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]
trainAcc = np.zeros(len(maxdepths))
testAcc = np.zeros(len(maxdepths))
index = 0
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc[index] = accuracy_score(Y_train, Y_predTrain)
testAcc[index] = accuracy_score(Y_test, Y_predTest)
index += 1
#########################################
# Plot of training and test accuracies
#########################################
plt.plot(maxdepths,trainAcc,'ro-',maxdepths,testAcc,'bv--')
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('Max depth')
plt.ylabel('Accuracy')
The plot above shows that training accuracy will continue to improve as the maximum depth of
the tree increases (i.e., as the model becomes more complex). However, the test accuracy initially
4
improves up to a maximum depth of 5, before it gradually decreases due to model overfitting.
[∑
N ]1
p
Minkowski distance(x, y) = |xi − yi | p
i=1
for k in numNeighbors:
clf = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
5
For logistic regression, the model can be described by the following equation:
1
P (y = 1|x) = = σ(wT x + b)
1 + exp−wT x−b
The model parameters (w,b) are estimated by optimizing the following regularized negative log-
likelihood function:
∑
N [ ] [ ]
∗ ∗ 1
(w , b ) = arg min − yi log σ(w xi + b) + (1 − yi ) log σ(−w xi − b) + Ω([w, b])
T T
w,b C
i=1
where C is a hyperparameter that controls the inverse of model complexity (smaller values imply
stronger regularization) while Ω(·) is the regularization term, which by default, is assumed to be
an l2 -norm in sklearn.
For support vector machine, the model parameters (w∗ , b∗ ) are estimated by solving the following
constrained optimization problem:
∥w∥2 1 ∑
min + ξi
w∗ ,b∗ ,{ξi } 2 C
i
[ ]
s.t. ∀i : yi w ϕ(xi ) + b ≥ 1 − ξi , ξi ≥ 0
T
for param in C:
clf = linear_model.LogisticRegression(C=param)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
LRtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
LRtestAcc.append(accuracy_score(Y_test, Y_predTest))
clf = SVC(C=param,kernel='linear')
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
6
ax1.legend(['Training Accuracy','Test Accuracy'])
ax1.set_xlabel('C')
ax1.set_xscale('log')
ax1.set_ylabel('Accuracy')
Note that linear classifiers perform poorly on the data since the true decision boundaries between
classes are nonlinear for the given 2-dimensional dataset.
for param in C:
clf = SVC(C=param,kernel='rbf',gamma='auto')
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
Observe that the nonlinear SVM can achieve a higher test accuracy compared to linear SVM.
7
[ ]: from sklearn import ensemble
from sklearn.tree import DecisionTreeClassifier
numBaseClassifiers = 500
maxdepth = 10
trainAcc = []
testAcc = []
clf = ensemble.RandomForestClassifier(n_estimators=numBaseClassifiers)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf = ensemble.
,→BaggingClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estimators=numBaseClassifiers
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf = ensemble.
,→AdaBoostClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estimators=numBaseClassifier
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: