Unit 4 DM

Chapter 8.
Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
1
Unit 3

2
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
3
Prediction Problems:
Classification vs. Numeric
Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute

and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values

 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
4
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute

 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

The known label of test sample is compared with the classified
result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Chapter 8. Classification: Basic
Concepts
8
Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
9
Algorithm for Decision Tree
Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)

 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf

 There are no samples left
10
Brief Review of Entropy
m=2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D ) 
j
Info( D j )
j 1 | D |
 Information gained by branching on attribute A

Gain(A) Info(D)  Info A(D)
12
Attribute Selection: Information
Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D )  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info( D )  Infoage ( D ) 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
>40 medium no excellent no 13
for Continuous-Valued
Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
14
Gain Ratio for Attribute
Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)   log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019

 The attribute with the maximum gain ratio is selected as the
splitting attribute
15
Gini Index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D)
is defined as n 2
gini( D) 1  p j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D | |D |
gini A ( D)  1 gini( D1)  2 gini( D 2)
|D| |D|
 Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
16
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
 9  5
gini ( D) 1       0.459
 14   14 
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values
 Can be modified for categorical attributes
17
Comparing Attribute Selection
Measures
 The three measures, in general, return good results but
 Information gain:

biased towards multivalued attributes
 Gain ratio:

tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions
and purity in both partitions
18
Other Attribute Selection
Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
19
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting

 Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a

threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees


Use a set of data different from the training data to
decide which is the “best pruned tree” 20
Enhancements to Basic Decision Tree
Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
21
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
 Why is decision tree induction popular?

relatively faster learning speed (than other classification
methods)

convertible to simple and easy to understand classification
rules

can use SQL queries for accessing databases

comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC-list (attribute, value, class label)
22
Scalability Framework for
RainForest
 Separates the scalability aspects from the criteria that

determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n
23
Rainforest: Training Set and Its
AVC Sets
Training Examples AVC-set on Age AVC-set on income

age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer
<=30 high no fair no yes no

yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
student Buy_Computer Buy_Computer
>40 medium yes fair yes
Credit
<=30 medium yes excellent yes yes no
rating yes no
31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
24
Optimistic Algorithm for Tree
Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.
25
Presentation of Classification
Results
November 10, 2024 Data Mining: Concepts and Techniques 26

SGI/MineSet 3.0
November 10, 2024 Data Mining: Concepts and Techniques 27

Perception-Based Classification
(PBC)
Data Mining: Concepts and Techniques 28

Concepts
29
Bayesian Classification:
Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
30
Bayes’ Theorem: Basics
M
 Total probability Theorem: P(B)   P(B | Ai )P( Ai )
i 1
 Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)

P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability

E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds

E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
31
Prediction Based on Bayes’
Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P(H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)

P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
32
Classification Is to Derive the Maximum
Posteriori
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only

P(C | X) P(X | C )P(C )
i i i
needs to be maximized
33
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i ) P( x | C i ) P( x | C i ) ...P( x | C i )
k 1 2 n
 This greatly reduces the computation cost: Only counts the
k  1
class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x  )2
1 
and P(xk|Ci) is g ( x,  ,  )  e 2 2
2 
P ( X | C i )  g ( xk ,  Ci ,  Ci )
34
Naïve Bayes Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified:
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
35
Naïve Bayes Classifier: An
Example age income studentcredit_rating
buys_comp
31…40 high no fair yes

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
>40
medium
low
low
no fair
yes fair
yes excellent
yes
yes
no
P(buys_computer = “no”) = 5/14= 0.357 31…40

<=30
low
medium
yes excellent
no fair
yes
no

Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40

>40
high
medium
yes fair
no excellent
yes
no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”) 36
Avoiding the Zero-Probability
Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts 37
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
38
Concepts
39
Using IF-THEN Rules for
Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy

 n
covers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute tests)

 Class-based ordering: decreasing order of prevalence or misclassification
cost per class

 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

40
Rule Extraction from a Decision
Tree
 Rules are easier to understand than large
trees age?
 One rule is created for each path from the <=30 31..40 >40
root to a leaf student? credit rating?
yes
 Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes no yes
prediction
 Rules are mutually exclusive and exhaustive
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
41
Rule Induction: Sequential
Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the

quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously
42
Sequential Covering Algorithm
while (enough target tuples left)

generate a rule
remove positive target tuples satisfying this rule
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
43
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
44
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy

Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy

Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos 'neg ' pos  neg

favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
45
Chapter 9. Classification: Advanced
Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
46
Bayesian Belief Networks
 Bayesian belief networks (also known as Bayesian networks,
probabilistic networks): allow class conditional independencies
between subsets of variables
 A (directed acyclic) graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
 Nodes: random variables

 Links: dependency
X Y
 X and Y are the parents of Z, and Y
is the parent of P
Z
P  No dependency between Z and P
 Has no loops/cycles
47
Bayesian Belief Network: An
Example
Family CPT: Conditional Probability

Smoker (S)
History (FH) Table for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1

LungCancer
Emphysema ~LC 0.2 0.5 0.3 0.9
(LC)
shows the conditional probability

for each possible combination of its
parents
Derivation of the probability of a
PositiveXRay Dyspnea particular combination of values
of X, from CPT:
n
Bayesian Belief Network P ( x ,..., x
1 n )   P ( x i | Parents (Y i ))
i 1
48
Training Bayesian Networks:
Several Scenarios
 Scenario 1: Given both the network structure and all variables observable:
compute only the CPT entries
 Scenario 2: Network structure known, some variables hidden: gradient
descent (greedy hill-climbing) method, i.e., search for a solution along the
steepest descent of a criterion function
 Weights are initialized to random probability values
 At each iteration, it moves towards what appears to be the best solution
at the moment, w.o. backtracking

 Weights are updated at each iteration & converge to local optimum
 Scenario 3: Network structure unknown, all variables observable: search

through the model space to reconstruct network topology
 Scenario 4: Unknown structure, all hidden variables: No good algorithms
known for this purpose
 D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning
in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
49
Methods
50
Classification by
Backpropagation
 Backpropagation: A neural network learning algorithm

 Started by psychologists and neurobiologists to develop and
test computational analogues of neurons
 A neural network: A set of connected input/output units
where each connection has a weight associated with it
 During the learning phase, the network learns by adjusting
the weights so as to be able to predict the correct class label
of the input tuples
 Also referred to as connectionist learning due to the
connections between units
51
Neural Network as a Classifier
 Weakness
 Long training time
 Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on an array of real-world data, e.g., hand-written letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of rules
from trained neural networks
52
A Multi-Layer Feed-Forward Neural
Network
Output vector
w(jk 1) w(jk )   ( yi  yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
53
How A Multi-Layer Neural Network
Works
 The inputs to the network correspond to the attributes measured for each
training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
 The network is feed-forward: None of the weights cycles back to an input
unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function
54
Defining a Network Topology
 Decide the network topology: Specify # of units in the input
layer, # of hidden layers (if > 1), # of units in each hidden layer,
and # of units in the output layer
 Normalize the input values for each attribute measured in the
training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes, one
output unit per class is used
 Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
55
Backpropagation
 Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
 For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
 Initialize weights to small random numbers, associated with biases
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)
 Terminating condition (when error is very small, etc.)
56
Neuron: A Hidden/Output Layer
Unit
bias
x0 w0 k
x1 w1
 f output y
xn wn For Example
n
y sign( wi xi   k )
Input weight weighted Activation i 0
vector x vector w sum function

 An n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
 The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the bias
associated with unit. Then a nonlinear activation function is applied to it.
57
Efficiency and Interpretability
 Efficiency of backpropagation: Each epoch (one iteration through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in worst case
 For easier comprehension: Rule extraction by network pruning
 Simplify the network structure by removing weighted links that have the
least effect on the trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit layers
 Sensitivity analysis: assess the impact that a given input variable has on a
network output. The knowledge gained from this analysis can be
represented in rules
58
Methods
59
Classification: A Mathematical
Mapping
 Classification: predicts categorical class labels

 E.g., Personal homepage classification
 x = (x , x , x , …), y = +1 or –1
i 1 2 3 i
 x1 : # of word “homepage”
x
 x2 : # of word “welcome” x
x x x
 Mathematically, x  X = n, y  Y = {+1, –1}, x
x x x o
 We want to derive a function f: X  Y
o
x o o o o
 Linear Classification o
o o
 Binary Classification problem o o o o
 Data above the red line belongs to class ‘x’
 Data below red line belongs to class ‘o’
 Examples: SVM, Perceptron, Probabilistic Classifiers

60
Discriminative Classifiers
 Advantages
 Prediction accuracy is generally high

As compared to Bayesian methods – in general
 Robust, works when training examples contain errors
 Fast evaluation of the learned target function

Bayesian networks are normally slow
 Criticism
 Long training time
 Difficult to understand the learned function (weights)

Bayesian networks can be used easily for pattern discovery
 Not easy to incorporate domain knowledge

Easy in the form of priors on the data or distributions
61
SVM—Support Vector Machines
 A relatively new classification method for both linear and
nonlinear data
 It uses a nonlinear mapping to transform the original training
data into a higher dimension
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
 SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
62
SVM—History and Applications
 Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
 Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
 Used for: classification and numeric prediction
 Applications:
 handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
63
SVM—General Philosophy
Small Margin Large Margin

Support Vectors
64
SVM—Margins and Support
Vectors
Data Mining: Concepts and Techniques

November 10, 2024
65
SVM—When Data Is Linearly
Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
66
SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints 
Quadratic Programming (QP)  Lagrangian multipliers
67
Why Is SVM Effective on High Dimensional
Data?
 The complexity of trained classifier is characterized by the # of

support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training
examples —they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier,
which is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have
good generalization, even when the dimensionality of the data is
high
68
A2
SVM—Linearly Inseparable
 Transform the original input data into a higher

A1
dimensional space
 Search for a linear separating hyperplane in the

new space
69
SVM: Different Kernel
functions
 Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions
 SVM can also be used for classifying multiple (> 2)

classes and for regression analysis (with additional
parameters)
70
Scaling SVM by Hierarchical Micro-
Clustering
 SVM is not scalable to the number of data objects in terms of training time
and memory usage
 H. Yu, J. Yang, and J. Han, “
Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD'03)
 CB-SVM (Clustering-Based SVM)
 Given limited amount of system resources (e.g., memory), maximize
the SVM performance in terms of accuracy and the training speed
 Use micro-clustering to effectively reduce the number of points to be
considered
 At deriving support vectors, de-cluster micro-clusters near “candidate
vector” to ensure high classification accuracy
71
CF-Tree: Hierarchical Micro-
cluster
 Read the data set once, construct a statistical summary of

the data (i.e., hierarchical clusters) given a limited amount
of memory
 Micro-clustering: Hierarchical indexing structure
 provide finer samples closer to the boundary and coarser
samples farther from the boundary 72
Selective Declustering: Ensure High Accuracy
 CF tree is a suitable base structure for selective declustering

 De-cluster only the cluster Ei such that
 Di – Ri < Ds, where Di is the distance from the boundary to the center point
of Ei and Ri is the radius of Ei
 Decluster only the cluster whose subclusters have possibilities to be the
support cluster of the boundary

“Support cluster”: The cluster whose centroid is a support vector
73
CB-SVM Algorithm: Outline
 Construct two CF-trees from positive and negative data sets
independently
 Need one scan of the data set
 Train an SVM from the centroids of the root entries

 De-cluster the entries near the boundary into the next level
 The children entries de-clustered from the parent entries
are accumulated into the training set with the non-

declustered parent entries
 Train an SVM again from the centroids of the entries in the
training set
 Repeat until nothing is accumulated
74
Accuracy and Scalability on Synthetic
Dataset
 Experiments on large synthetic data sets shows better accuracy

than random sampling approaches and far more scalable than
the original SVM algorithm
75
SVM vs. Neural Network
 SVM  Neural Network

 Deterministic algorithm  Nondeterministic
 Nice generalization algorithm
properties
 Generalizes well but
doesn’t have strong
 Hard to learn – learned in
mathematical foundation
batch mode using
 Can easily be learned in
quadratic programming
incremental fashion
techniques
 To learn complex functions
 Using kernels can learn —use multilayer
very complex functions perceptron (nontrivial)
76
SVM Related Links
 SVM Website: http://www.kernel-machines.org/

 Representative implementations
 LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also
various interfaces with java, python, etc.
 SVM-light: simpler but performance is not better than
LIBSVM, support only binary classification and only in C
 SVM-torch: another recent implementation also written in C
77
Methods
78
Associative Classification
 Associative classification: Major steps

Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels

Association rules are generated in the form of
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)

Organize the rules to form a rule-based classifier
 Why effective?

It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time

Associative classification has been found to be often more accurate than
some traditional classification methods, such as C4.5
79
Typical Associative Classification
Methods
 CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)

 Mine possible association rules in the form of

Cond-set (a set of attribute-value pairs)  class label
 Build classifier: Organize rules according to decreasing precedence based
on confidence and then support
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei,
ICDM’01)
 Classification: Statistical analysis on multiple rules
 CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
 Generation of predictive rules (FOIL-like analysis) but allow covered rules to
retain with reduced weight
 Prediction using best k rules
 High efficiency, accuracy similar to CMAR 80
Frequent Pattern-Based Classification
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “
Discriminative Frequent Pattern Analysis for Effective Classification
”, ICDE'07
 Accuracy issue

Increase the discriminative power

Increase the expressive power of the feature space
 Scalability issue

It is computationally infeasible to generate all feature
combinations and filter them with an information gain threshold

Efficient method (DDPMine: FPtree pruning): H. Cheng, X. Yan, J.
Han, and P. S. Yu, "
Direct Discriminative Pattern Mining for Effective Classification",
ICDE'08
81
Frequent Pattern vs. Single
Feature
The discriminative power of some frequent patterns is

higher than that of single features.
(a) Austral (b) Cleve (c) Sonar
Fig. 1. Information Gain vs. Pattern Length
82
Empirical Results
1
0.9 InfoGain
IG_UpperBnd
0.8
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Support
(a) Austral (b) Breast (c) Sonar
Fig. 2. Information Gain vs. Pattern Frequency
83
Feature Selection
 Given a set of frequent patterns, both non-discriminative and
redundant patterns exist, which can cause overfitting
 We want to single out the discriminative patterns and remove
redundant ones
 The notion of Maximal Marginal Relevance (MMR) is borrowed
 A document has high marginal relevance if it is both relevant
to the query and contains minimal marginal similarity to
previously selected documents
84
Experimental Results
85
85
Scalability Tests
86
DDPMine: Branch-and-Bound
Search
sup(child ) sup( parent )

sup(b) sup(a )
a: constant, a parent
node Association between
b: variable, a information gain and
descendent frequency
87
DDPMine Efficiency: Runtime
PatClass
Harmony
PatClass: DDPMine
ICDE’07
Pattern
Classification
Alg.
88

Unit 4 DM

Uploaded by

Copyright:

Available Formats

Unit 4 DM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 DM

Uploaded by

Copyright:

Available Formats

Chapter 8.

 Classification: Basic Concepts

 classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute

unknown or missing values

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

determined by the class label attribute

 The model is represented as classification rules, decision trees, or

NAME RANK YEARS TENURED Classifier

student? yes credit rating?

no yes excellent fair

 Attributes are categorical (if continuous-valued, they are

statistical measure (e.g., information gain)

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf

 Information gained by branching on attribute A

 gain_ratio(income) = 0.029/1.557 = 0.019

 All attributes are assumed continuous-valued

 Can be modified for categorical attributes

 Two approaches to avoid overfitting

if this would result in the goodness measure falling below a

get a sequence of progressively pruned trees

 Separates the scalability aspects from the criteria that

Training Examples AVC-set on Age AVC-set on income

<=30 high no fair no yes no

November 10, 2024 Data Mining: Concepts and Techniques 26

November 10, 2024 Data Mining: Concepts and Techniques 27

Data Mining: Concepts and Techniques 28

 Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)

P(H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)

 Since P(X) is constant for all classes, only

P(buys_computer = “no”) = 5/14= 0.357 31…40

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

Prob(income = low) = 1/1003

 Good results obtained in most of the cases

 Assessment of a rule: coverage and accuracy

 ncorrect = # of tuples correctly classified by R

“toughest” requirement (i.e., with the most attribute tests)

cost per class

priority list, according to some measure of rule quality or by experts

condition, e.g., when no more training examples or when the

while (enough target tuples left)

 Gives a specification of joint probability distribution

 Nodes: random variables

Family CPT: Conditional Probability

LC 0.8 0.5 0.7 0.1

shows the conditional probability

 At each iteration, it moves towards what appears to be the best solution

at the moment, w.o. backtracking

 Scenario 3: Network structure unknown, all variables observable: search

 Backpropagation: A neural network learning algorithm

vector x vector w sum function

 Classification: predicts categorical class labels

 Data below red line belongs to class ‘o’

 Examples: SVM, Perceptron, Probabilistic Classifiers

 Fast evaluation of the learned target function