Data Mining Book
Data Mining Book
Data Mining Book
induction • Prediction
1
Classification vs. Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
2
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
3
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
5
Supervised vs. Unsupervised Learning
6
Issues: Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
7
Issues: Evaluating Classification Methods
• Accuracy
– classifier accuracy: predicting class label
– predictor accuracy: guessing value of predicted attributes
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
8
Decision Tree Induction: Training Dataset
age?
<=30 overcast
31..40 >40
no yes yes
10
Algorithm for Decision Tree Induction
11
Attribute Selection Measure: Information
Gain (ID3/C4.5)
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
12
Attribute Selection: Information Gain
• If a data set D contains examples from n classes, gini index, gini(D) is defined
as n
gini( D) 1 p 2j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as
|D1| |D |
gini A (D) gini(D1) 2 gini(D2)
• Reduction in Impurity: |D| |D|
but gini{medium,high} is 0.30 and thus the best since it is the lowest
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes
17
Comparing Attribute Selection Measures
19
Overfitting and Tree Pruning
20
Enhancements to Basic Decision Tree Induction
22
Scalable Decision Tree Induction Methods
23
Scalability Framework for RainForest
24
Rainforest: Training Set and Its AVC Sets
27
Presentation of Classification Results
28
Visualization of a Decision Tree in SGI/MineSet 3.0
29
Interactive Visual Mining by Perception-Based
Classification (PBC)
30
Bayesian Classification: Why?
31
Bayesian Theorem: Basics
34
Derivation of Naïve Bayes Classifier
and P(xk|Ci) is 2
P(X | C i) g ( xk , Ci , Ci )
35
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
36
Naïve Bayesian Classifier: An Example
37
Avoiding the 0-Probability Problem
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium
(990), and income = high (10),
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts
38
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
39
Bayesian Belief Networks
• Several scenarios:
– Given both the network structure and all variables
observable: learn only the CPTs
– Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
– Network structure unknown, all variables observable:
search through the model space to reconstruct network
topology
– Unknown structure, all hidden variables: No good
algorithms known for this purpose
• Ref. D. Heckerman: Bayesian networks for data mining
42
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule is triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
– Class-based ordering: decreasing order of prevalence or misclassification cost per
class
– Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality or by experts
43
Rule Extraction from a Decision Tree
age
?
<=30 31..40 >40
Rules are easier to understand than large trees
student? credit rating?
yes
One rule is created for each path from the root to a
n yes excellent fair
leaf o
no yes yes
Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
Rules are mutually exclusive and exhaustive
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
44
Rule Extraction from the Training Data
45
How to Learn-One-Rule?
pos neg
FOIL _ Prune( R)
neg by R.
poscovered
Pos/neg are # of positive/negative tuples
If FOIL_Prune is higher for the pruned version of R, prune R
46
Classification: A Mathematical Mapping
• Classification:
– predicts categorical class labels
• E.g., Personal homepage classification
– xi = (x1, x2, x3, …), yi = +1 or –1
– x1 : # of a word “homepage”
– x2 : # of a word “welcome”
• Mathematically
– x X = n, y Y = {+1, –1}
– We want a function f: X Y
47
Linear Classification
48
Discriminative Classifiers
• Advantages
– prediction accuracy is generally high
• As compared to Bayesian methods – in general
– robust, works when training examples contain errors
– fast evaluation of the learned target function
• Bayesian networks are normally slow
• Criticism
– long training time
– difficult to understand the learned function (weights)
• Bayesian networks can be used easily for pattern discovery
– not easy to incorporate domain knowledge
• Easy in the form of priors on the data or distributions
49
Perceptron & Winnow
• Vector: x, w
x2
• Scalar: x, y, w
Input: {(x1, y1), …}
Output: classification function f(x)
f(xi) > 0 for yi = +1
f(xi) < 0 for yi = -1
f(x) => wx + b = 0
or w1x1+w2x2+b = 0
• Perceptron: update W
additively
• Winnow: update W
multiplicatively
x1
50
SVM—Support Vector Machines
51
SVM—History and Applications
52
SVM—General Philosophy
53
SVM—Margins and Support
Vectors
54
SVM—When Data Is Linearly Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated
with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find
the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
55
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic
Programming (QP) Lagrangian multipliers
56
Why Is SVM Effective on High Dimensional Data?
57
A2
SVM—Linearly Inseparable
A1
Transform the original input data into a higher dimensional
space
58
SVM—Kernel functions
Instead of computing the dot product on the transformed data tuples, it is
mathematically equivalent to instead applying a kernel function K(Xi, Xj) to
the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions
SVM can also be used for classifying multiple (> 2) classes and for regression
analysis (with additional user parameters)
59
Scaling SVM by Hierarchical Micro-Clustering
• SVM is not scalable to the number of data objects in terms of training time
and memory usage
• “Classifying Large Datasets Using SVMs with Hierarchical Clusters Problem”
by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
• CB-SVM (Clustering-Based SVM)
– Given limited amount of system resources (e.g., memory), maximize the
SVM performance in terms of accuracy and the training speed
– Use micro-clustering to effectively reduce the number of points to be
considered
– At deriving support vectors, de-cluster micro-clusters near “candidate
vector” to ensure high classification accuracy
60
CB-SVM: Clustering-Based SVM
62
CB-SVM Algorithm: Outline
63
Selective Declustering
64
Experiment on Synthetic Dataset
65
Experiment on a Large Data Set
66
SVM vs. Neural Network
(x x )( yi y )
w w y w x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
71
Regression Trees and Model Trees
72
Predictive Modeling in Multidimensional Databases
73
Prediction: Numerical Data
74
Prediction: Categorical Data
75
C1 C2
76
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
• Loss function: measures the error betw. yi and the predicted value yi’
– Absolute error: | yi – yi’|
– Squared error: (yi – yi’)2
• Test error (generalization error):
d
the average loss over the test setd
– Mean absolute error: | y yMean
i 1
'| squared error:
i i ( y y ')
i 1
i i
2
d d
d
| y Relative
y '|
i squared error:
i
( yi yi ' ) 2
i 1
i 1
d d
| y
i 1
i y|
(y
i 1
i y)2
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
77
Evaluating the Accuracy of a Classifier or Predictor (I)
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in each
fold is approx. the same as that in the initial data
78
Evaluating the Accuracy of a Classifier or Predictor (II)
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d tuples. The data set is sampled d times, with
replacement, resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About 63.2% of the
original data will end up in the bootstrap, and the remaining 36.8% will form the
test set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedue k times, overall accuracy of the model:
k
acc( M ) (0.632 acc( M i )test _ set 0.368 acc( M i )train_ set )
i 1
79
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
80
Bagging: Boostrap Aggregation