Chapter 8. Classification: Basic Concepts
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction
Data Unseen Data
(Jeff, Professor, 4)
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Chapter 8. Classification: Basic Concepts
◼ Attribute Ranking
◼ Information gain
◼ Gain ratio
◼ Gini-index
◼ Splitting Criterion
◼ Decision Tree Pruning
◼ Top- Down : Pre-pruning
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-
conquer manner
◼ At start, all the training examples are at the root
discretized in advance)
◼ Examples are partitioned recursively based on selected
◼ Test attributes are selected on the basis of a heuristic or
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
Info( D) = − pi log 2 ( pi )
i =1
◼ Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D) = Info( D j )
j =1 | D |
◼ Information gained by branching on attribute A
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
yes fair
yes fair
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 14
Computing Information-Gain for
Continuous-Valued Attributes
◼ Let attribute A be a continuous-valued attribute
◼ Must determine the best split point for A
◼ Sort the value A in increasing order
◼ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◼ The point with the minimum expected information
requirement for A is selected as the split-point for A
◼ Split:
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
◼ Information gain measure is biased towards attributes with a
large number of values
◼ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
◼ GainRatio(A) = Gain(A)/SplitInfo(A)
◼ Ex.
◼ D = {D1={1,2,3}, D2={4,5,6}}
Other Attribute Selection Measures
◼ CHAID: a popular decision tree algorithm, measure based on χ2 test for
◼ C-SEP: performs better than info. gain and gini index in certain cases
◼ G-statistic: has a close approximation to χ2 distribution
◼ MDL (Minimal Description Length) principle (i.e., the simplest solution is
◼ The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
◼ Multivariate splits (partition based on multiple variable combinations)
◼ CART: finds multivariate splits based on a linear comb. of attrs.
◼ Which attribute selection measure is the best?
◼ Most give good results, none is significantly superior than others
Overfitting and Tree Pruning
◼ Overfitting: An induced tree may overfit the training data
◼ Too many branches, some may reflect anomalies due to noise or
◼ Poor accuracy for unseen samples- [Extreme case- roll-id as top
Practice Quiz
◼ Which measures of attribute ranking is/ are biased
towards attributes having large number of unique
◼ What strategy of attribute ranking may lead to
Enhancements to Basic Decision Tree Induction
Scalability Framework for RainForest
Rainforest: Training Set and Its AVC Sets
Presentation of Classification Results
medium income
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x, , ) = e 2 2
and P(xk|Ci) is 2
P ( X | C i ) = g ( xk , C i , Ci )
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
age income student
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
“uncorrected” counterparts
Naïve Bayes Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss
of accuracy
◼ Practically, dependencies exist among variables
Bayes Classifier
◼ How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
Chapter 8. Classification: Basic Concepts
◼ Components
◼ X1, X2: Numerical Input
◼ f is non-linear and is called the Activation Function - takes a single
number and performs a certain fixed mathematical operation on it
◼ 1: Bias with weight b .
Activation Fuction
[Ref- Engelbrecht Andries P., Computational Intelligence: An Introduction, Wiley]
1 1 1
1 1 1 0 1 0 1 1 1 t=1
0 1 0
1 1 1
1 1 1
1 1 1 0 1 0 1 1 0 t=-
0 1 0
1 1 0 t
Iteration 1 1 1 1 0 1 0 1 1 1
Iteration 2 1 1 1 0 1 0 1 1 0
Layers of Neurons: (e.g. in Feedforward NN)
◼ Input nodes
◼ No computation is performed in any of the Input nodes
◼ They just pass on the information to the hidden nodes
◼ Hidden nodes
◼ They perform computations and transfer information from the input
nodes to the output nodes.
◼ There can be zero or multiple hidden layers
◼ Output nodes
◼ Responsible for computations and transferring information from the
network to the outside world
◼ One output node for one decision parameter
MLP Back-Propagation (Feedforward) Network
Chapter 8. Classification: Basic Concepts
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Rule Generation
◼ To generate a rule
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive Negative
examples examples
How to Learn-One-Rule?
◼ Start with the most general rule possible: condition = empty
◼ Adding new attributes by adopting a greedy depth-first strategy
◼ Picks the one that most improves the rule quality
a ~a b ~b c ~c
a 49 b c
~a ~b ~c
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C
Class Imbalance Problem:
C TP FN P / Recall
Specificity ◼ One class may be rare, e.g.
P’ N’ All fraud, or HIV-positive
Precision Accuracy ◼ Significant majority of the
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
Classifier Evaluation Metrics: Example
Practice Quiz
◼ Evaluate relative importance of accuracy, sensitivity, specificity,
etc/ in following decision-making cases
◼ Attacking a foreign aircraft by automized surveillance
in case of positive
◼ Detecting Corona positive, Corona- negative if immediate
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
◼ Holdout method
◼ Given data is randomly partitioned into two independent sets
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
◼ Suppose we have 2 classifiers, M1 and M2, which one is better?
◼ These mean error rates are just estimates of error on the true
population of future data cases
Estimating Confidence Intervals:
Null Hypothesis
◼ Perform 10-fold cross-validation
◼ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we can reject null hypothesis, then
◼ we conclude that the difference between M1 & M2 is
statistically significant
◼ Chose model with lower error rate
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Estimating Confidence Intervals:
Table for t-distribution
◼ Symmetric
◼ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
◼ Confidence limit, z
= sig/2
Estimating Confidence Intervals:
Statistical Significance
◼ Are M1 & M2 significantly different?
◼ Compute t. Select significance level (e.g. sig = 5%)
◼ If t > t(z) or t < -t(z), then t value lies in rejection region:
are same
◼ Conclude: statistically significant difference between M1
&M 75
Model Selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Originated from signal detection theory
◼ Shows the trade-off between the true
positive rate and the false positive rate
◼ The area under the ROC curve is a ◼ Vertical axis
measure of the accuracy of the model represents the true
positive rate
◼ Rank the test tuples in decreasing ◼ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ◼ The plot also shows a
the top of the list diagonal line
◼ The closer to the diagonal line (i.e., the ◼ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
Issues Affecting Model Selection
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
Chapter 8. Classification: Basic Concepts
◼ Ensemble methods
◼ Use a combination of models to increase accuracy
◼ Boosting: weighted vote with a collection of classifiers
Bagging: Boostrap Aggregation
◼ Analogy: Diagnosis based on multiple doctors’ majority vote
◼ Training
◼ Given a set D of d tuples, at each iteration i, a training set Di of d tuples
◼ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
◼ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
◼ Accuracy
◼ Often significantly better than a single classifier derived from D
◼ Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
◼ How boosting works?
◼ Weights are assigned to each training tuple
◼ A series of k classifiers is iteratively learned
◼ After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
◼ The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
◼ Boosting algorithm can be extended for numeric prediction
◼ Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 82
Practice Quiz
◼ What will be the output of boosting-based classifier if
the four models’ outputs are 3, 5, 4, 4 respectively.
Consider the respective accuracy values as 5, 5, 4, 8.
◼ {respective weights – 5/8, 5/8, 4/8, 8/8 =}
Adaboost (Freund and Schapire, 1997)
◼ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
◼ Initially, all the weights of tuples are set the same (1/d)
◼ Generate k classifiers in k rounds. At round i,
◼ Tuples from D are sampled (with replacement) to form a training set
Di of the same size
◼ Each tuple’s chance of being selected is based on its weight
◼ A classification model Mi is derived from Di
◼ Its error rate is calculated using Di as a test set
◼ If a tuple is misclassified, its weight is increased, o.w. it is decreased
◼ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
error( M i ) = w j err( X j )
◼ The weight of classifier Mi’s vote is 1 − error( M i )
error( M i )
Random Forest (Breiman 2001)
◼ Random Forest:
◼ Each classifier in the ensemble is a decision tree classifier and is
◼ Two Methods to construct Random Forest:
◼ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
◼ Forest-RC (random linear combinations): Creates new attributes (or
Summary (II)
◼ Significance tests and ROC curves are useful for model selection.
◼ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
◼ No single method has been found to be superior over all others
for all data sets
◼ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
