dm4
dm4
dm4
3
Prediction Problems:
Classification vs. Numeric Prediction
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
Classification training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Typical applications Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
Algorithm for Decision Tree Induction
m=2
10
Attribute Selection Measure: Information Gain
(ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log2 ( pi )
i =1
j =1 | D |
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) = 0.048
31…40 high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-Valued
Attributes
• If a data set D contains examples from n classes, gini index, gini(D) is defined as
n
gini( D) = 1− p 2j
j =1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
|D1| |D |
giniA ( D) = gini( D1) + 2 gini( D2)
|D| |D|
• CHAID: a popular decision tree algorithm, measure based on χ2 test for independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistic: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
• The best tree as the one that requires the fewest # of bits to both (1) encode the tree,
and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
• Most give good results, none is significantly superior than others
18
Overfitting and Tree Pruning
• Too many branches, some may reflect anomalies due to noise or outliers
• Poor accuracy for unseen samples
• Prepruning: Halt tree construction early ̵ do not split a node if this would result in
the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
• Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide which is the “best
pruned tree”
Enhancements to Basic Decision Tree Induction
Allow for
Dynamically define new discrete-valued attributes that partition the continuous
continuous-
attribute value into a discrete set of intervals
valued attributes
Attribute Create new attributes based on existing ones that are sparsely represented
construction This reduces fragmentation, repetition, and replication
Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine
learning researchers
Scalability: Classifying data sets with millions of examples and hundreds of attributes
with reasonable speed
Separates the scalability aspects from the criteria that determine the quality of
the tree
• It turns out that T’ is very close to the tree that would be generated using the whole data set
together
24
Bayesian Classifier
Bayesian Classification: Why?
• Bayes’ Theorem:
P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the
Bayes’ theorem
29
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by
|Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard( x −deviation
) σ 2
1 −
g ( x, , ) = e 2 2
2
• IF age = young AND student = yes THEN buys_computer = yes student? credit rating?
yes
• IF age = mid-age THEN buys_computer = yes
• IF age = old AND credit_rating = excellent THEN buys_computer = no no yes excellent fair
• IF age = old AND credit_rating = fair THEN buys_computer = yes no yes yes
Rule Induction: Sequential Covering Method
39
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
41
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive
are actually positive
42
Classifier Evaluation Metrics: Example
43
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized data
• *Stratified cross-validation*: folds are stratified so that class dist. in each
fold is approx. the same as that in the initial data
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-
added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a training set
of d samples. The data tuples that did not make it into the training set end up forming the
test set. About 63.2% of the original data end up in the bootstrap, and the remaining
36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
• Suppose we have 2 classifiers, M1 and M2, which one is better?
• These mean error rates are just estimates of error on the true population of future
data cases
• What if the difference between the 2 error rates is just attributed to chance?
46
Estimating Confidence Intervals:
Null Hypothesis
• Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
47
Estimating Confidence Intervals: t-test
where
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2
49
Estimating Confidence Intervals:
Statistical Significance
50
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a ◼ Vertical axis
measure of the accuracy of the model represents the true
positive rate
• Rank the test tuples in decreasing order:
the one that is most likely to belong to
◼ Horizontal axis rep.
the false positive rate
the positive class appears at the top of
the list ◼ The plot also shows a
diagonal line
• The closer to the diagonal line (i.e., the ◼ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
51
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
52
Chapter 8. Classification: Basic Concepts
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
54
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with
the most votes to X
• Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
55
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
56
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training
set Di of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is
decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
d
error( M i ) = w j err( X j )
j
1 − error( M i )
• The weight of classifier Mi’s vote is log
error( M i )
57
Random Forest
• Random Forest:
(Breiman 2001)
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size
• Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and
outliers
• Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
58
Classification of Class-Imbalanced Data Sets
59
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set
d d
• Mean absolute error: | yi − Mean
yi ' | squared error: ( yi − yi ' ) 2
i =1 i =1
d d
d
60
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and waits
until it is given a test tuple
• Eager learning (the above discussed methods): Given a
set of training tuples, constructs a classification model
before receiving new (e.g., test) data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form an
implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the
entire instance space
61
Lazy Learner: Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based
inference
62
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space
• The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
• Target function could be discrete- or real- valued
• For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
• Vonoroi diagram: the decision surface induced by 1-NN
for a typical set of training examples
_
_
_ _
.
+
_
. +
xq +
. . .
_ + . 63
Discussion on the k-NN Algorithm
• k-NN for real-valued prediction for a given unknown tuple
• Returns the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors
according to their distance to the query xq
w 1
• Give greater weight to closer neighbors
d ( xq , x )2
i
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
• To overcome it, axes stretch or elimination of the least
relevant attributes
64
Linear Regression
• Linear regression: involves a response variable y and a single predictor
variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line
| D|
(x − x )( yi − y )
w = i =1
i
w = y −w x
1 | D|
0 1
(x − x) i =1
i
2
66
Other Regression-Based Models
• Generalized linear model:
• Foundation on which linear regression can be applied to modeling
categorical response variables
• Variance of y is a function of the mean value of y, not a constant
• Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
• Poisson regression: models the data that exhibit a Poisson distribution
• Log-linear models: (for categorical data)
• Approximate discrete multidimensional prob. distributions
• Also useful for data compression and smoothing
• Regression trees and model trees
• Trees to predict continuous values rather than class labels
67
Regression Trees and Model Trees
CART: Classification And Regression Trees
Regression tree: proposed Each leaf stores a continuous-valued