Chapter 4
Chapter 4
Chapter 4
1
Chapter 4. Classification and Prediction
■ Medical diagnosis
■ Fraud detection
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class,
■
The known label of test sample is compared with the
classified result from the model
■
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
■
Test set is independent of training set, otherwise over-fitting
will occur
■ If the accuracy is acceptable, use the model to classify data
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
■ Supervised learning (classification)
■
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)
■
The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues: Data Preparation
■ Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
■ Relevance analysis (feature selection)
Remove the irrelevant or redundant
■
attributes
■ Speed
■
time to construct the model (training time)
■
time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■
understanding and insight provided by the
■ model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “
buys_computer”
age
no yes no yes
Algorithm for Decision Tree
Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Examples are partitioned recursively based on selected
■ attributes
Test attributes are selected on the basis of a heuristic or statistical
■ measurefor(e.g.,
Conditions information
stopping gain)
partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
■ There are no samples left
Tree construction general
algorithm
Two steps: recursively generate the tree 1-4),
and prune the tree (5)
A1 = humidity
[9+, 5-] A2 = wind
[9+, 5-]
normal high weak strong
𝑝
Example : Training data for concept
“play-tennis
‐ ”
From 14 examples of Play-Tennis, 9 positive and 5
negative objects (denote by [9+, 5-‐] )
Entropy( [9+, 5-]
‐ ) = − (9/14)log2(9/14) −
(5/14)log2(5/14)
= 0.940
Notice:
1. Entropy is 0 if all members of S belong to the same
class
2. Entropy is 1 if the collection contains an equal number
of positive and negative examples. If these numbers are
unequal, the entropy is between 0 and 1.
Information gain measures the
expected reduction in entropy
We define a measure, called information gain, of the effectiveness of an
attribute in classifying data. It is the expected reduction in entropy
caused by partitioning the objects according to this attribute
where Value(A) is the set of all possible values for attribute A, and Sv is
the subset of S for which A has value v .
Information gain measures the
expected reduction in entropy
8 6
= 0.940 − 0.811 − 𝑥 1.0 = 0.048
14 14
Which attribute is the best classifier?
Humidity Wind
0.029
Next step in growing the decision tree
Attributes with many values
classification rules
■ can use SQL queries for accessing databases
prediction? (SVM)
P(X|Ci)P(Ci)
P(Ci | X)
P(X)
■ Since P(X) is constant for all classes, only
P(Ci | X) P(X|
needs to be maximized
C )P(C )
Derivation of Naïve Bayes
Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( | ) P( | ) P( | ) ... P(
x Ci x Ci x Ci x | Ci)
P(X | C i ) k 1 2 n
k
■ This greatly reduces 1the computation cost: Only counts
the class distribution
■ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of Ci
■ in D)
based
If Ak is on Gaussian distribution
continous-valued, P(xk|Ci)with a mean
is usually μ and
computed
standard deviation σ 1
( x) 2
g(x, , ) e2 2
2
and P(xk|Ci) is
P (X | Ci) g (xk , C , C
i i
Naïve Bayesian Classifier: Training
Dataset
age income student redit_rating _com
c
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes
Data sample >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Naïve Bayesian Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
■
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
counterparts
Naïve Bayesian
Comment
Classifier:
■ Advantages s
■
Easy to implement
■
Good results obtained in most of the cases
■ Disadvantages
■
Assumption: class conditional independence, therefore
loss of accuracy
■ Practically, dependencies exist among variables
■
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
■
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
■ How to deal with these dependencies?
■
Bayesian Belief Networks
Naive Bayesian Classifier Example
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
P) = 9/14*2/9*3/9*3/9*6/9 = 0.01
N) = 5/14*3/5*1/5*4/5*2/5 = 0.013
P(X|p)·P(p) =
P(X|n)·P(n) =
resolution
■ Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
■ Class-based ordering: decreasing order of prevalence or
misclassification cost per class
■ Rule-based ordering (decision list ): rules are organized into one
long priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision
Tree
age?
■ Each time a rule is learned, the tuples covered by the rules are
removed
■ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-
■
Rule?
Star with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
pos
condition FOIL _ Gain pos'(log
2 )
pos' log 2 pos'neg' pos
neg
It favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
pos neg
FOIL _ Prune(R) pos neg
■ x1 : # of a word “homepage”
x2 : # of a word
■
■
“welcome” Mathematically
■ x X = n, y Y = {+1, –
1}
■ We want a function f: X
Linear Classification
Binary Classification
x problem
x x
x x The data above the red
line belongs to class ‘x’
x x o
x x The data below red line
o belongs to class ‘o’
x o o
oo o Examples: SVM,
o
o Perceptron, Probabilistic
o o o
Classifiers
o
Discriminative Classifiers
■ Advantages
■
prediction accuracy is generally high
■
As compared to Bayesian methods – in general
■
robust, works when training examples contain errors
■
fast evaluation of the learned target function
■
Bayesian networks are normally slow
■ Criticism
■
long training time
■ difficult to understand the learned function (weights)
■
Bayesian networks can be used easily for pattern discovery
■ not easy to incorporate domain knowledge
■
Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
x2 •Scalar: x, y, w
f(x)i) =>
f(x wxyi+=b-1= 0
< 0 for
or w1x1+w2x2+b = 0
•Perceptron: update
W additively
•Winnow: update
W multiplicatively
x1
Perceptron & Winnow
Classification by
Backpropagation
■ Backpropagation: A neural network learning algorithm
■ Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
■ A neural network: A set of connected input/output units
where each connection has a weight associated with it
■ During the learning phase, the network learns by
adjusting the weights so as to be able to predict
the correct class label of the input tuples
■ Also referred to as connectionist learning due to
the connections between units
Neural Network as a Classifier
■ Weakness
■ Long training time
■ Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
■ Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
■ Strength
■ High tolerance to noisy data
■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
■ Successful on a wide array of real-world data
■ Algorithms are inherently parallel
■ Techniques have recently been developed for the extraction of
rules from trained neural networks
A Neuron (= a
perceptron)
x0 w0 - k
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k
vector vector sum function )
i0
x w
■ The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping
Neural Networks
What are they?
Based on early research aimed at representing the
way the human brain works
Neural networks are composed of many processing
units called neurons
65
Neural Networks are great, but..
Problem 1: The black box model!
Solution: 1. Do we really need to know?
Solution 2. Rule Extraction techniques
66
Neural Network Concepts
Neural computing
Artificial neural network (ANN)
Dendrites
Synapse
Synapse
Axon
Axon
Dendrites Soma
Soma
x1
w1 Y1
.
. Summation
Transfer
.
Function
wn Yn
xn
ANN
Model
Three-step process:
1. Compute temporary
Compute
output outputs
2. Compare outputs with
desired targets
3. Adjust the weights and
Is desired
Adjust
weights
No
output repeat the process
achieved?
Yes
Stop
learning
How a Network Learns
Learning parameters:
Learning rate
Momentum
Backpropagation Learning
a(Zi – Yi)
x1 error
w1
. Summation
Transfer
Function
wn
xn
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
SVM—Linearly Separable
■
A separating hyperplane can be written as W
●X+b=0
where W={w1, w2, …, wn} is a weight vector
■
and b a scalar (bias)
For 2-D it can be written as w0
■
+ w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2:
■
w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
■
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints Quadratic
Programming ► Lagrangian multipliers
Why Is SVM Effective on High
Dimensional Data?
■
The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data
■
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
■
If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
■
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
A
2
SVM—Linearly
Inseparable
A
space
■ SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Scaling SVM by Hierarchical Micro-
Clustering
■ SVM is not scalable to the number of data objects in terms of training
time and memory usage
■ “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
■ CB-SVM (Clustering-Based SVM)
■ Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
■ Use micro-clustering to effectively reduce the number of points to
be considered
■ At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
CB-SVM: Clustering-Based SVM
■ Training data sets may not even fit in memory
■ Read the data set once (minimizing disk access)
■ Construct a statistical summary of the data (i.e., hierarchical
clusters) given a limited amount of memory
■ The statistical summary maximizes the benefit of learning
SVM
■ The summary plays a role in indexing SVMs
■ Essence of Micro-clustering (Hierarchical indexing structure)
■ Use micro-cluster hierarchical indexing structure
■
provide finer samples closer to the boundary and coarser
samples farther from the boundary
■ Selective de-clustering to ensure high accuracy
CF-Tree: Hierarchical Micro-cluster
CB-SVM Algorithm: Outline
■ Construct two CF-trees from positive and negative data
sets independently
■
Need one scan of the data set
■ Train an SVM from the centroids of the root entries
■ De-cluster the entries near the boundary into the next
level
■ The children entries de-clustered from the parent
■ Prediction