VII - CS8031 - DMDW - Module 6 - Classification - VBP
VII - CS8031 - DMDW - Module 6 - Classification - VBP
VII - CS8031 - DMDW - Module 6 - Classification - VBP
1
Chapter 8. Classification: Basic Concepts
4
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Chapter 8. Classification: Basic Concepts
◼ Attribute Ranking
◼ Information gain
◼ Gain ratio
◼ Gini-index
◼ Splitting Criterion
◼ Decision Tree Pruning
◼ Top- Down : Pre-pruning
9
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
10
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-
conquer manner
◼ At start, all the training examples are at the root
discretized in advance)
◼ Examples are partitioned recursively based on selected
attributes
◼ Test attributes are selected on the basis of a heuristic or
m=2
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log 2 ( pi )
i =1
◼ Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D) = Info( D j )
j
j =1 | D |
◼ Information gained by branching on attribute A
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 14
Computing Information-Gain for
Continuous-Valued Attributes
◼ Let attribute A be a continuous-valued attribute
◼ Must determine the best split point for A
◼ Sort the value A in increasing order
◼ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◼ The point with the minimum expected information
requirement for A is selected as the split-point for A
◼ Split:
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
15
Gain Ratio for Attribute Selection (C4.5)
◼ Information gain measure is biased towards attributes with a
large number of values
◼ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
◼ GainRatio(A) = Gain(A)/SplitInfo(A)
◼ Ex.
◼ D = {D1={1,2,3}, D2={4,5,6}}
◼ gfgf
21
Other Attribute Selection Measures
◼ CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
◼ C-SEP: performs better than info. gain and gini index in certain cases
◼ G-statistic: has a close approximation to χ2 distribution
◼ MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
◼ The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
◼ Multivariate splits (partition based on multiple variable combinations)
◼ CART: finds multivariate splits based on a linear comb. of attrs.
◼ Which attribute selection measure is the best?
◼ Most give good results, none is significantly superior than others
22
Overfitting and Tree Pruning
◼ Overfitting: An induced tree may overfit the training data
◼ Too many branches, some may reflect anomalies due to noise or
outliers
◼ Poor accuracy for unseen samples- [Extreme case- roll-id as top
24
Practice Quiz
◼ Which measures of attribute ranking is/ are biased
towards attributes having large number of unique
values?
◼ What strategy of attribute ranking may lead to
overfitting?
25
Enhancements to Basic Decision Tree Induction
27
Scalability Framework for RainForest
28
Rainforest: Training Set and Its AVC Sets
30
Presentation of Classification Results
medium income
36
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
37
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
38
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x, , ) = e 2 2
and P(xk|Ci) is 2
P ( X | C i ) = g ( xk , C i , Ci )
39
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
40
Naïve Bayes Classifier: An Example
age income student
credit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
“uncorrected” counterparts
42
Naïve Bayes Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss
of accuracy
◼ Practically, dependencies exist among variables
Bayes Classifier
◼ How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
43
Chapter 8. Classification: Basic Concepts
◼ Components
◼ X1, X2: Numerical Input
◼ f is non-linear and is called the Activation Function - takes a single
number and performs a certain fixed mathematical operation on it
◼ 1: Bias with weight b .
Activation Fuction
[Ref- Engelbrecht Andries P., Computational Intelligence: An Introduction, Wiley]
1 1 1
1 1 1 0 1 0 1 1 1 t=1
0 1 0
1 1 1
1 1 1
1 1 1 0 1 0 1 1 0 t=-
0 1 0
1
1 1 0 t
Iteration 1 1 1 1 0 1 0 1 1 1
Iteration 2 1 1 1 0 1 0 1 1 0
48
Layers of Neurons: (e.g. in Feedforward NN)
◼ Input nodes
◼ No computation is performed in any of the Input nodes
◼ They just pass on the information to the hidden nodes
◼ Hidden nodes
◼ They perform computations and transfer information from the input
nodes to the output nodes.
◼ There can be zero or multiple hidden layers
◼ Output nodes
◼ Responsible for computations and transferring information from the
network to the outside world
◼ One output node for one decision parameter
MLP Back-Propagation (Feedforward) Network
50
51
52
53
Chapter 8. Classification: Basic Concepts
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
58
Rule Generation
◼ To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
59
How to Learn-One-Rule?
◼ Start with the most general rule possible: condition = empty
◼ Adding new attributes by adopting a greedy depth-first strategy
◼ Picks the one that most improves the rule quality
a ~a b ~b c ~c
a 49 b c
~a ~b ~c
64
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C
Sensitivity
Class Imbalance Problem:
◼
C TP FN P / Recall
Specificity ◼ One class may be rare, e.g.
¬C FP TN N
P’ N’ All fraud, or HIV-positive
Precision Accuracy ◼ Significant majority of the
65
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
66
Classifier Evaluation Metrics: Example
67
Practice Quiz
◼ Evaluate relative importance of accuracy, sensitivity, specificity,
etc/ in following decision-making cases
◼ Attacking a foreign aircraft by automized surveillance
in case of positive
◼ Detecting Corona positive, Corona- negative if immediate
68
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
◼ Holdout method
◼ Given data is randomly partitioned into two independent sets
70
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
◼ Suppose we have 2 classifiers, M1 and M2, which one is better?
◼ These mean error rates are just estimates of error on the true
population of future data cases
71
Estimating Confidence Intervals:
Null Hypothesis
◼ Perform 10-fold cross-validation
◼ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we can reject null hypothesis, then
◼ we conclude that the difference between M1 & M2 is
statistically significant
◼ Chose model with lower error rate
72
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
73
Estimating Confidence Intervals:
Table for t-distribution
◼ Symmetric
◼ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
◼ Confidence limit, z
= sig/2
74
Estimating Confidence Intervals:
Statistical Significance
◼ Are M1 & M2 significantly different?
◼ Compute t. Select significance level (e.g. sig = 5%)
2.262
◼ If t > t(z) or t < -t(z), then t value lies in rejection region:
are same
◼ Conclude: statistically significant difference between M1
&M 75
Model Selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Originated from signal detection theory
◼ Shows the trade-off between the true
positive rate and the false positive rate
◼ The area under the ROC curve is a ◼ Vertical axis
measure of the accuracy of the model represents the true
positive rate
◼ Rank the test tuples in decreasing ◼ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ◼ The plot also shows a
the top of the list diagonal line
◼ The closer to the diagonal line (i.e., the ◼ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
76
Issues Affecting Model Selection
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
77
Chapter 8. Classification: Basic Concepts
◼ Ensemble methods
◼ Use a combination of models to increase accuracy
classifiers
◼ Boosting: weighted vote with a collection of classifiers
79
Bagging: Boostrap Aggregation
◼ Analogy: Diagnosis based on multiple doctors’ majority vote
◼ Training
◼ Given a set D of d tuples, at each iteration i, a training set Di of d tuples
◼ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
◼ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
◼ Accuracy
◼ Often significantly better than a single classifier derived from D
81
Boosting
◼ Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
◼ How boosting works?
◼ Weights are assigned to each training tuple
◼ A series of k classifiers is iteratively learned
◼ After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
◼ The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
◼ Boosting algorithm can be extended for numeric prediction
◼ Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 82
Practice Quiz
◼ What will be the output of boosting-based classifier if
the four models’ outputs are 3, 5, 4, 4 respectively.
Consider the respective accuracy values as 5, 5, 4, 8.
◼ {respective weights – 5/8, 5/8, 4/8, 8/8 =}
83
Adaboost (Freund and Schapire, 1997)
◼ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
◼ Initially, all the weights of tuples are set the same (1/d)
◼ Generate k classifiers in k rounds. At round i,
◼ Tuples from D are sampled (with replacement) to form a training set
Di of the same size
◼ Each tuple’s chance of being selected is based on its weight
◼ A classification model Mi is derived from Di
◼ Its error rate is calculated using Di as a test set
◼ If a tuple is misclassified, its weight is increased, o.w. it is decreased
◼ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
d
error( M i ) = w j err( X j )
j
◼ The weight of classifier Mi’s vote is 1 − error( M i )
log
error( M i )
84
Random Forest (Breiman 2001)
◼ Random Forest:
◼ Each classifier in the ensemble is a decision tree classifier and is
returned
◼ Two Methods to construct Random Forest:
◼ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
◼ Forest-RC (random linear combinations): Creates new attributes (or
88
Summary (II)
◼ Significance tests and ROC curves are useful for model selection.
◼ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
◼ No single method has been found to be superior over all others
for all data sets
◼ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
89
References (1)
◼ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
◼ C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
◼ L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
◼ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
◼ P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
◼ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
◼ H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
◼ W. Cohen. Fast effective rule induction. ICML'95
◼ G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
90
References (2)
◼ A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
◼ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
◼ R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
◼ U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
◼ Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
◼ J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
◼ J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
◼ T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
◼ D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
◼ W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
91
References (3)
◼ T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
◼ J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
◼ M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
◼ T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
◼ S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
◼ J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
◼ J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
◼ J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
◼ J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
92
References (4)
◼ R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
◼ J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
◼ J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
◼ P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
◼ S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
◼ S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
◼ I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
◼ X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
◼ H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
93
CS412 Midterm Exam Statistics
◼ Opinion Question Answering:
◼ Like the style: 70.83%, dislike: 29.16%
◼ 80-89: 54 ◼ 50-59: 15
◼ 70-79: 46 ◼ 40-49: 2
◼ Speed
◼ time to construct the model (training time)
96
Predictor Error Measures
◼ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
◼ Loss function: measures the error betw. yi and the predicted value yi’
◼ Absolute error: | yi – yi’|
◼ Squared error: (yi – yi’)2
◼ Test error (generalization error):
d
the average loss over the test set
d
d d
d
| y −Relative
y '|
i squared error:
i
( yi − y i ' ) 2
i =1
i =1
d d
| y
i =1
i −y|
(y
i =1
i − y)2
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
97
Scalable Decision Tree Induction Methods
tree earlier
◼ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
◼ Builds an AVC-list (attribute, value, class label)
98
Data Cube-Based Decision-Tree Induction
◼ Integration of generalization with decision-tree induction
(Kamber et al.’97)
◼ Classification at primitive concept levels
◼ E.g., precise temperature, humidity, outlook, etc.
◼ Low-level concepts, scattered classes, bushy classification-
trees
◼ Semantic interpretation problems
◼ Cube-based multi-level classification
◼ Relevance analysis at multi-levels
◼ Information-gain analysis with dimension + level
99