Unit 4 DM
Unit 4 DM
Unit 4 DM
Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
1
Unit 3
2
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
3
Prediction Problems:
Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
4
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
8
Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
9
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
m=2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info( D j )
j 1 | D |
May need other tools, e.g., clustering, to get the possible split
values
17
Comparing Attribute Selection
Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions
and purity in both partitions
18
Other Attribute Selection
Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
19
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
23
Rainforest: Training Set and Its
AVC Sets
25
Presentation of Classification
Results
29
Bayesian Classification:
Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
30
Bayes’ Theorem: Basics
M
Total probability Theorem: P(B) P(B | Ai )P( Ai )
i 1
33
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ...P( x | C i )
k 1 2 n
This greatly reduces the computation cost: Only counts the
k 1
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1
and P(xk|Ci) is g ( x, , ) e 2 2
2
P ( X | C i ) g ( xk , Ci , Ci )
34
Naïve Bayes Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
35
Naïve Bayes Classifier: An
Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
>40
medium
low
low
no fair
yes fair
yes excellent
yes
yes
no
“uncorrected” counterparts 37
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
38
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
39
Using IF-THEN Rules for
Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until termination
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
43
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
44
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and accuracy
Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition pos ' pos
FOIL _ Gain pos '(log 2 log 2 )
pos 'neg ' pos neg
favors rules that have high accuracy and cover many positive tuples
Rule pruning based on an independent set of test tuples
pos neg
FOIL _ Prune( R)
pos neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
45
Chapter 9. Classification: Advanced
Methods
Bayesian Belief Networks
Classification by Backpropagation
Support Vector Machines
Classification by Using Frequent Patterns
46
Bayesian Belief Networks
Bayesian belief networks (also known as Bayesian networks,
probabilistic networks): allow class conditional independencies
between subsets of variables
A (directed acyclic) graphical model of causal relationships
Represents dependency among the variables
50
Classification by
Backpropagation
51
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on an array of real-world data, e.g., hand-written letters
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules
from trained neural networks
52
A Multi-Layer Feed-Forward Neural
Network
Output vector
w(jk 1) w(jk ) ( yi yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
53
How A Multi-Layer Neural Network
Works
The inputs to the network correspond to the attributes measured for each
training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
The network is feed-forward: None of the weights cycles back to an input
unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function
54
Defining a Network Topology
Decide the network topology: Specify # of units in the input
layer, # of hidden layers (if > 1), # of units in each hidden layer,
and # of units in the output layer
Normalize the input values for each attribute measured in the
training tuples to [0.0—1.0]
One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes, one
output unit per class is used
Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
55
Backpropagation
Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights to small random numbers, associated with biases
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
56
Neuron: A Hidden/Output Layer
Unit
bias
x0 w0 k
x1 w1
f output y
xn wn For Example
n
y sign( wi xi k )
Input weight weighted Activation i 0
58
Chapter 9. Classification: Advanced
Methods
Bayesian Belief Networks
Classification by Backpropagation
Support Vector Machines
Classification by Using Frequent Patterns
59
Classification: A Mathematical
Mapping
x = (x , x , x , …), y = +1 or –1
i 1 2 3 i
x1 : # of word “homepage”
x
x2 : # of word “welcome” x
x x x
Mathematically, x X = n, y Y = {+1, –1}, x
x x x o
We want to derive a function f: X Y
o
x o o o o
Linear Classification o
o o
Binary Classification problem o o o o
Data above the red line belongs to class ‘x’
As compared to Bayesian methods – in general
Robust, works when training examples contain errors
Bayesian networks are normally slow
Criticism
Long training time
Bayesian networks can be used easily for pattern discovery
Not easy to incorporate domain knowledge
Easy in the form of priors on the data or distributions
61
SVM—Support Vector Machines
A relatively new classification method for both linear and
nonlinear data
It uses a nonlinear mapping to transform the original training
data into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
62
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
Used for: classification and numeric prediction
Applications:
handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
63
SVM—General Philosophy
64
SVM—Margins and Support
Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
66
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints
Quadratic Programming (QP) Lagrangian multipliers
67
Why Is SVM Effective on High Dimensional
Data?
SVM—Linearly Inseparable
70
Scaling SVM by Hierarchical Micro-
Clustering
SVM is not scalable to the number of data objects in terms of training time
and memory usage
H. Yu, J. Yang, and J. Han, “
Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD'03)
CB-SVM (Clustering-Based SVM)
Given limited amount of system resources (e.g., memory), maximize
the SVM performance in terms of accuracy and the training speed
Use micro-clustering to effectively reduce the number of points to be
considered
At deriving support vectors, de-cluster micro-clusters near “candidate
vector” to ensure high classification accuracy
71
CF-Tree: Hierarchical Micro-
cluster
73
CB-SVM Algorithm: Outline
Construct two CF-trees from positive and negative data sets
independently
Need one scan of the data set
74
Accuracy and Scalability on Synthetic
Dataset
77
Chapter 9. Classification: Advanced
Methods
Bayesian Belief Networks
Classification by Backpropagation
Support Vector Machines
Classification by Using Frequent Patterns
78
Associative Classification
Associative classification: Major steps
Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
Association rules are generated in the form of
P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)
Organize the rules to form a rule-based classifier
Why effective?
It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
Associative classification has been found to be often more accurate than
some traditional classification methods, such as C4.5
79
Typical Associative Classification
Methods
82
Empirical Results
1
0.9 InfoGain
IG_UpperBnd
0.8
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Support
83
Feature Selection
Given a set of frequent patterns, both non-discriminative and
redundant patterns exist, which can cause overfitting
We want to single out the discriminative patterns and remove
redundant ones
The notion of Maximal Marginal Relevance (MMR) is borrowed
A document has high marginal relevance if it is both relevant
to the query and contains minimal marginal similarity to
previously selected documents
84
Experimental Results
85
85
Scalability Tests
86
DDPMine: Branch-and-Bound
Search
a: constant, a parent
node Association between
b: variable, a information gain and
descendent frequency
87
DDPMine Efficiency: Runtime
PatClass
Harmony
PatClass: DDPMine
ICDE’07
Pattern
Classification
Alg.
88
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: