Chapter 4

D a t a Science in Business
Chapter 4 — Classification and

Prediction
Dr. LE SONG THANH QUYNH

Ho Chi Minh City University of Technology
1
Chapter 4. Classification and Prediction
■ What is classification? What is ■ Support Vector Machines

prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
Classification vs. Prediction
■ Classification
■
predicts categorical class labels (discrete or nominal)
■ classifies data (constructs a model) based on the
training set and the values (class labels) in a

classifying attribute and uses it in classifying new data
■ Prediction
■
models continuous-valued functions, i.e., predicts
unknown or missing values
■ Typical applications
■
Credit approval
■ Target marketing
■ Medical diagnosis
■ Fraud detection
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute

■
The set of tuples used for model construction is training set
■
The model is represented as classification rules, decision trees,
or mathematical formulae
■ Model usage: for classifying future or unknown objects
■ Estimate accuracy of the model
■
The known label of test sample is compared with the
classified result from the model
■
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
■
Test set is independent of training set, otherwise over-fitting
will occur
■ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known

Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured =
‘yes’
Process (2): Using the Model
in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
■ Supervised learning (classification)
■
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)
■
The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues: Data Preparation
■ Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
■ Relevance analysis (feature selection)
Remove the irrelevant or redundant
■
■ attributes Data transformation

■
Generalize and/or normalize data
Issues: Evaluating Classification
Methods
■ Accuracy
■
classifier accuracy: predicting class label
■ predictor accuracy: guessing value of predicted
attributes
■ Speed
■
time to construct the model (training time)
■
time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■
understanding and insight provided by the
■ model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “
buys_computer”
age
<=30 31....40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
Algorithm for Decision Tree
Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Examples are partitioned recursively based on selected
■ attributes
Test attributes are selected on the basis of a heuristic or statistical
■ measurefor(e.g.,
Conditions information
stopping gain)
partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
■ There are no samples left
Tree construction general
algorithm
Two steps: recursively generate the tree 1-4),
and prune the tree (5)
1. At each node, choose the “best”

attribute by a given measure for
attribute selection
2. Extend tree by adding new branch for

each value of the attribute
3. Sorting training examples to leaf

nodes
4. If examples in a node belong to one

class Then Stop Else Repeat steps 1-4 for
leaf nodes
5. Prune the tree to avoid over-fitting
Example : Training data for concept
“play-tennis
‐ ”
• A typical dataset in machine learning

• 14 objects belonging to two class {Y,
N} are observed on 4 properties.
• Dom(Outlook) =
{sunny, overcast, rain}
• Dom(Temperature) =
{hot, mild, cool}
• Dom(humidity) =
{high, normal}
• Dom(Wind) =
{weak, strong}
A decision tree for playing tennis
A simple decision tree for playing
tennis
This tree is much simpler as “outlook” is selected at the

root. How to select good attribute to split a decision
node?
Which attribute is the best?
• The “playing-tennis” set S contains 9 positive objects (+) and 5
negative objects (-), denote by [9+, 5-]
• If attributes “humidity” and “wind” split S into sub-nodes with

proportions of positive and negative objects as below, which
attribute is better?
A1 = humidity
[9+, 5-] A2 = wind
[9+, 5-]
normal high weak strong
[6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-]

Entropy
• Entropy characterizes the impurity (purity) of an

arbitrary collection of objects .
 S is the collection of positive and negative objects
 𝑝 is the proportion of positive objects in S
 𝑝 is the proportion of negative objects in S
 In the play-tennis example, these numbers are 14, 9/14 and
5/14, respectively
• Entropy is defined as follows

Entropy
The entropy function relative to

a Boolean classification, as the y
op
rt
n
proportion 𝑝 of positive e
objects varies between 0 and 1.

If the collection has c distinct
groups of objects then the
entropy is defined by
𝑝
Example : Training data for concept
“play-tennis
‐ ”
From 14 examples of Play-Tennis, 9 positive and 5
negative objects (denote by [9+, 5-‐] )
Entropy( [9+, 5-]
‐ ) = − (9/14)log2(9/14) −
(5/14)log2(5/14)
= 0.940
Notice:
1. Entropy is 0 if all members of S belong to the same
class
2. Entropy is 1 if the collection contains an equal number
of positive and negative examples. If these numbers are
unequal, the entropy is between 0 and 1.
Information gain measures the
expected reduction in entropy
We define a measure, called information gain, of the effectiveness of an
attribute in classifying data. It is the expected reduction in entropy
caused by partitioning the objects according to this attribute
where Value(A) is the set of all possible values for attribute A, and Sv is
the subset of S for which A has value v .
Information gain measures the
expected reduction in entropy
Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+,
2-] Sstrong , the subnode with value “strong”, is [3+,
3-]
𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 𝑆𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣)
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝑣∈{w𝑒𝑎𝑘,𝑠𝑡𝑟𝑜𝑛g} 𝑆
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆w e a k − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟 𝑜 𝑛 g)

8 6
8 6
= 0.940 − 0.811 − 𝑥 1.0 = 0.048
14 14
Which attribute is the best classifier?
S:[9+, 5-] S:[9+, 5-]

E = 0.940 E = 0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E = 0.985 E = 0.592 E = 0.811 E = 1.00
Gain(S, Humidity) Gain(S, Wind)

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.00
= .151 = .048
Information gain of all attributes
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) =
0.029
Next step in growing the decision tree
Attributes with many values
• If attribute has many values (e.g., days of the

month), ID3 will select it
• C4.5 uses GainRatio instead
Measures for attribute selection
Overfitting and Tree
Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or
outliers
■ Poor accuracy for unseen samples
■ Two approaches to avoid overfitting
■ Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
■
Difficult to choose an appropriate threshold
■ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
■
Use a set of data different from the training data to decide
which is the “best pruned tree”
Enhancements to Basic Decision Tree
Induction
■ Allow for continuous-valued attributes
■ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
■ Handle missing attribute values
■
Assign the most common value of the attribute
■
Assign probability to each of the possible values
■ Attribute construction
■
Create new attributes based on existing ones that are
sparsely represented
■ This reduces fragmentation, repetition, and
replication
Classification in Large Databases
■ Classification—a classical problem extensively studied by

statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
■ Why decision tree induction in data mining?
■
relatively faster learning speed (than other classification
methods)
■ convertible to simple and easy to understand
classification rules
■ can use SQL queries for accessing databases
■ comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods
■ SLIQ (EDBT’96 — Mehta et al.)

■ Builds an index for each attribute and only class list and
the current attribute list reside in memory

■ SPRINT (VLDB’96 — J. Shafer et al.)
■
Constructs an attribute list data
■ structure PUBLIC (VLDB’98 — Rastogi &
Shim)
■ Integrates tree splitting and tree pruning: stop growing
the tree earlier

■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■
Builds an AVC-list (attribute, value, class label)
■ BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
■
Uses bootstrapping to create several small samples
Assignment 4 – 09/04/2024
Chapter 4 . Classification and
Prediction
prediction? (SVM)

■ Summary
Bayesian Classification: Why?
■ A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
■ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
■ Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem:
Basics
■ Let X be a data sample (“evidence”): class label is
■ unknown Let H be a hypothesis that X belongs to class C
■ Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
■ P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,
■
■ … P(X): probability that sample data is observed

■ P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
■
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Bayesian
Theorem
■ Given training data X, posteriori probability of a
hypothesis
H, P(H|X), follows
P ( Hthe )  P(X|
| XBayes theorem
H )P(H )
P(X)
■ Informally, this can be written as posteriori =
likelihood x prior/evidence
■ Predicts X belongs to Ci if the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
■
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Towards Naïve Bayesian
Classifier
■ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n- D attribute
vector X = (x1, x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.
■
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem
P(X|Ci)P(Ci)
P(Ci | X) 
P(X)
■ Since P(X) is constant for all classes, only
P(Ci | X)  P(X|
needs to be maximized
C )P(C )
Derivation of Naïve Bayes
Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( | )  P( | )  P( | ) ... P(
 x Ci x Ci x Ci x | Ci)
P(X | C i )  k 1 2 n
k
■ This greatly reduces 1the computation cost: Only counts
the class distribution
■ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of Ci
■ in D)
based
If Ak is on Gaussian distribution
continous-valued, P(xk|Ci)with a mean
is usually μ and
computed
standard deviation σ 1 
( x) 2
g(x,  ,  )  e2 2
2
and P(xk|Ci) is
P (X | Ci)  g (xk ,  C , C
i i
Naïve Bayesian Classifier: Training
Dataset
age income student redit_rating _com
c
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes
Data sample >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Naïve Bayesian Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
■
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (buy-computer = ‘yes’)

Avoiding the 0-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
 P(x k |
P( X | C i ) 
Ci) k 1
■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

medium (990), and income = high (10),
■ Use Laplacian correction (or Laplacian estimator)
■ Adding 1 to each case
Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
■ The “corrected” prob. estimates are close to their “uncorrected”
counterparts
Naïve Bayesian
Comment
Classifier:
■ Advantages s
■
Easy to implement
■
Good results obtained in most of the cases
■ Disadvantages
■
Assumption: class conditional independence, therefore
loss of accuracy
■ Practically, dependencies exist among variables
■
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
■
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
■ How to deal with these dependencies?
■
Bayesian Belief Networks
Naive Bayesian Classifier Example
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
To classify a new sample X:

outlook = sunny
temperature = cool
humidity = high
windy = false
Outlook Temperature Humidity Windy Class
overcast hot high false P
rain mild high false P
rain cool normal false P
overcast cool normal true P
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
9
overcast mild high true P
overcast hot normal false P
Outlook Temperature Humidity Windy Class

sunny hot high false N
sunny hot high true N
rain cool normal true N 5
sunny mild high false N
rain mild high true N
Given the training set, we compute the probabilities:
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
We also have the probabilities

P = 9/14
N = 5/14
To classify a new sample X:
outlook = sunny
temperature = cool
humidity = high
windy = false
Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|
P) = 9/14*2/9*3/9*3/9*6/9 = 0.01
Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|
N) = 5/14*3/5*1/5*4/5*2/5 = 0.013
Therefore X takes class label N

Second example X = <rain, hot, high, false>
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class N (don’t play)

Bayesian Belief Networks
■ Bayesian belief network allows a subset of the variables

conditionally independent
■ A graphical model of causal relationships
■
Represents dependency among the variables
■ Gives a specification of joint probability distribution
❑ Nodes: random variables

❑ Links: dependency
X Y ❑X and Y are the parents of Z, and Y
is the parent of P
Z ❑ No dependency between Z and P
P
❑ Has no loops or cycles
Bayesian Belief Network: Exampl
An e
Family The conditional probability
Smoker
History table (CPT) for variable
LungCancer:
(FH, S) (FH, ~S)
LC 0.8 0.5 0.7 0.1
(~FH, S) (~FH, ~S)
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for

each possible combination of its parents
PositiveXRay Dyspnea Derivation of the probability of a

particular combination of values of X,
from CPT:
Bayesian Belief n
P ( x1 ,..., xn )   P ( xi | P a r e n t s ( Y i ) )
Networks i 1
Training Bayesian Networks
■ Several scenarios:
■
Given both the network structure and all variables
observable: learn only the CPTs
■
Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,
analogous to neural network learning
■
Network structure unknown, all variables observable:
search through the model space to reconstruct
network topology
■ Unknown structure, all hidden variables: No good
algorithms known for this purpose

■ Ref. D. Heckerman: Bayesian networks for data mining
Chapter 6. Classification
and Prediction
prediction? (SVM)
■ Summary
Using IF-THEN Rules for
Classification
■ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer =
yes
■ Rule antecedent/precondition vs. rule consequent
■ Assessment of a rule: coverage and

accuracy
■ n = # of tuples covered by R
covers
■ ncorect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set
*/ accuracy(R) = ncorect / ncovers
If more than one rule is triggered, need conflict
■
resolution
■ Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
■ Class-based ordering: decreasing order of prevalence or
misclassification cost per class
■ Rule-based ordering (decision list ): rules are organized into one
long priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision
Tree
age?
<=30 31..40 >40

■ Rules are easier to understand than large trees student? credit rating?
yes
■ One rule is created for each path from the root to excellent fair
no yes
a leaf no yes
no yes
■ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
■ Rules are mutually exclusive and exhaustive
■ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = THEN buys_computer = yes
yes IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer =
yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from the Training
Data
■ Sequential covering algorithm: Extracts rules directly from training data
■ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
■ Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
■
Steps:
■ Rules are learned one at a time
■ Each time a rule is learned, the tuples covered by the rules are
removed
■ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-
■
Rule?
Star with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
pos
condition FOIL _ Gain  pos'(log
2  )
pos' log 2 pos'neg' pos 
neg
It favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune(R)  pos  neg
Pos/neg are # of positive/negative tuples covered by R.

If FOIL_Prune is higher for the pruned version of R, prune R
and Prediction
prediction? (SVM)
■ Summary
Classification: A Mathematical
Mapping
■ Classification:
■predicts categorical class labels
■ E.g., Personal homepage classification

■ x = (x , x , x , …), y = +1 or –1
i 1 2 3 i
■ x1 : # of a word “homepage”
x2 : # of a word
■
■
“welcome” Mathematically
■ x  X = n, y  Y = {+1, –
1}
■ We want a function f: X
Linear Classification
Binary Classification
x problem
x x
x x The data above the red
line belongs to class ‘x’
x x o
x x The data below red line
o belongs to class ‘o’
x o o
oo o Examples: SVM,
o
o Perceptron, Probabilistic
o o o
Classifiers
o
Discriminative Classifiers
■ Advantages
■
prediction accuracy is generally high
■
As compared to Bayesian methods – in general
■
robust, works when training examples contain errors
■
fast evaluation of the learned target function
■
Bayesian networks are normally slow
■ Criticism
■
long training time
■ difficult to understand the learned function (weights)
■
Bayesian networks can be used easily for pattern discovery
■ not easy to incorporate domain knowledge
■
Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
x2 •Scalar: x, y, w
Input: {(x1, y1),

…}
Output: classification function f(x)
f(xi) > 0 for yi = +1
f(x)i) =>
f(x wxyi+=b-1= 0
< 0 for
or w1x1+w2x2+b = 0
•Perceptron: update
W additively
•Winnow: update
W multiplicatively
x1
Perceptron & Winnow
Classification by
Backpropagation
■ Backpropagation: A neural network learning algorithm
■ Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
■ A neural network: A set of connected input/output units
where each connection has a weight associated with it
■ During the learning phase, the network learns by
adjusting the weights so as to be able to predict
the correct class label of the input tuples
■ Also referred to as connectionist learning due to
the connections between units
Neural Network as a Classifier
■ Weakness
■ Long training time
■ Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
■ Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
■ Strength
■ High tolerance to noisy data
■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
■ Successful on a wide array of real-world data
■ Algorithms are inherently parallel
■ Techniques have recently been developed for the extraction of
rules from trained neural networks
A Neuron (= a
perceptron)
x0 w0 - k
x1 w1
 f
output y
xn wn
For Example
n
Input weight weighted Activation y  sign( wi xi  k
vector vector sum function )
i0
x w
■ The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping
Neural Networks
What are they?
Based on early research aimed at representing the
way the human brain works
Neural networks are composed of many processing
units called neurons
Types (Supervised versus Unsupervised)

Training
65
Neural Networks are great, but..
Problem 1: The black box model!
Solution: 1. Do we really need to know?
Solution 2. Rule Extraction techniques
Problem 2: Long training times

Solution 1: Get a faster PC with lots of RAM
Solution 2: Use faster algorithms “For example:
Quickprop”
Problems 3: Back propagation

Solution: Evolutionary Neural Networks!
66
Neural Network Concepts
Neural networks (NN): a brain metaphor for information

processing
Neural computing
Artificial neural network (ANN)
Many uses for ANN for

pattern recognition, forecasting, prediction, and classification
Many application areas

finance, marketing, manufacturing, operations, information
systems, and so on
Biological Neural Networks
Dendrites
Synapse
Synapse
Axon
Axon
Dendrites Soma
Soma
Two interconnected brain cells (neurons)

Processing Information in ANN
Inputs Weights Outputs
x1
w1 Y1
x2 w2 Neuron (or PE) f (S )

. S  
n
X iW
Y
. Y2
. i 1
i
.
. Summation
Transfer
.
Function
wn Yn
xn
A single neuron (processing element – PE) with inputs and

outputs
Elements of ANN
Processing element (PE)

Network architecture
Hidden layers
Parallel processing
Network information processing
Inputs
Outputs
Connection weights
Summation function
Neural Network Architectures
Recurrent Neural Networks
A Supervised Learning Process
ANN
Model
Three-step process:
1. Compute temporary
Compute
output outputs
2. Compare outputs with
desired targets
3. Adjust the weights and
Is desired
Adjust
weights
No
output repeat the process
achieved?
Yes
Stop
learning
How a Network Learns
Example: single neuron that learns the inclusive OR

operation
Learning parameters:
 Learning rate
 Momentum
Backpropagation Learning
a(Zi – Yi)
x1 error
w1
x2 w2 Neuron (or PE) f (S )

. S  
n
X iW i
Y  f (S ) Yi
. i 1
. Summation
Transfer
Function
wn
xn
Backpropagation of Error for a Single Neuron

Backpropagation Learning
The learning algorithm procedure:

1. Initialize weights with random values and set other
network parameters
2. Read in the inputs and the desired outputs
3. Compute the actual output (by working forward
through the layers)
4. Compute the error (difference between the actual
and desired output)
5. Change the weights by working backward through
the hidden layers
6. Repeat steps 2-5 until weights stabilize
Development Process of an
ANN
Neural Network Architectures
Architecture of a neural network is driven by the task it

is intended to address Classification, regression,
clustering, general optimization, association, ….
Most popular architecture: Feedforward, multi-layered

perceptron with backpropagation learning algorithm
used for both classification and regression type
problems
Other Popular ANN Paradigms
Self Organizing Maps (SOM)
Applications of SOM
Customer segmentation
Bibliographic classification
Image-browsing systems
Medical diagnosis
Interpretation of seismic activity
Speech recognition
Data compression
Environmental modeling, many more …
Application of ANN
Forecasting/Market Prediction: finance and banking
Manufacturing: quality control, fault diagnosis
Medicine: analysis of electrocardiogram data, RNA &

DNA sequencing, drug development without animal
testing
Control: process, robotics

and Prediction
prediction? (SVM)
■
SVM—Support Vector Machines
■ A new classification method for both linear and nonlinear
data
■ It uses a nonlinear mapping to transform the original
training data into a higher dimension
■ With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
■ With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
■ SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
SVM—History and Applications
■ Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
■ Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
■ Used both for classification and prediction
■ Applications:
■
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
SVM—General Philosophy
Small Margin Large Margin

Support Vectors
SVM—Margins and
Support Vectors
SVM—When Data Is
Linearly Separabl
e
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
SVM—Linearly Separable
■
A separating hyperplane can be written as W
●X+b=0
where W={w1, w2, …, wn} is a weight vector
■
and b a scalar (bias)
For 2-D it can be written as w0
■
+ w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2:
■
w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
■
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints Quadratic
Programming ► Lagrangian multipliers
Why Is SVM Effective on High
Dimensional Data?
■
The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data
■
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
■
If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
■
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
A
2
SVM—Linearly
Inseparable
A
Transform the original input data into a higher dimensional

1
■
space
■ Search for a linear separating hyperplane in the new space

SVM—Kernel functions
■ Instead of computing the dot product on the transformed data tuples, it
is mathematically equivalent to instead applying a kernel function
K(Xi,
■ Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions
■ SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Scaling SVM by Hierarchical Micro-
Clustering
■ SVM is not scalable to the number of data objects in terms of training
time and memory usage
■ “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
■ CB-SVM (Clustering-Based SVM)
■ Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
■ Use micro-clustering to effectively reduce the number of points to
be considered
■ At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
CB-SVM: Clustering-Based SVM
■ Training data sets may not even fit in memory
■ Read the data set once (minimizing disk access)
■ Construct a statistical summary of the data (i.e., hierarchical
clusters) given a limited amount of memory
■ The statistical summary maximizes the benefit of learning
SVM
■ The summary plays a role in indexing SVMs
■ Essence of Micro-clustering (Hierarchical indexing structure)
■ Use micro-cluster hierarchical indexing structure
■
provide finer samples closer to the boundary and coarser
samples farther from the boundary
■ Selective de-clustering to ensure high accuracy
CF-Tree: Hierarchical Micro-cluster
CB-SVM Algorithm: Outline
■ Construct two CF-trees from positive and negative data
sets independently
■
Need one scan of the data set
■ Train an SVM from the centroids of the root entries
■ De-cluster the entries near the boundary into the next
level
■ The children entries de-clustered from the parent
entries are accumulated into the training set with the

non-declustered parent entries
■ Train an SVM again from the centroids of the entries in the
training set
■ Repeat until nothing is accumulated
Selective Declustering
■ CF tree is a suitable base structure for selective declustering
■ De-cluster only the cluster Ei such that
■ Di – Ri < Ds, where Di is the distance from the boundary to
the center point of Ei and Ri is the radius of Ei
■
Decluster only the cluster whose subclusters have
possibilities to be the support cluster of the boundary
■
“Support cluster”: The cluster whose centroid is a
support vector
Experiment on Synthetic Dataset
Experiment on a Large Data
Set
SVM vs. Neural Network
■ SVM ■ Neural Network

■
Relatively new concept ■
Relatively old
■ Nondeterministic
■ Deterministic algorithm
algorithm
■ Nice Generalization ■ Generalizes well but
properties doesn’t have strong

■ Hard to learn – learned mathematical foundation
■ Can easily be learned in
in batch mode using
incremental fashion
quadratic programming
■ To learn complex
techniques
functions—use multilayer
■ Using kernels can learn perceptron (not that
very complex functions trivial)
What Is Prediction?
■ (Numerical) prediction is similar to classification
■ construct a model
■ use model to predict continuous or ordered value for a given input
■ Prediction is different from classification

■ Classification refers to predict categorical class label
■ Prediction models continuous-valued functions
■ Major method for prediction: regression

■
model the relationship between one or more independent or
predictor variables and a dependent or response
■ variable
Regression
■ Linear andanalysis
multiple regression
■ Non-linear regression
■ Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear Regression
■ Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression
■
coefficients Method|D|of least squares: estimates the best-fitting
straight line
w  i1 w yw
1 |D|
0 1

(x i x)( y
i1 (x  x)
i
2
i  y) x
■ Multiple linear regression: involves more than one predictor variable
■ Training data is of the form (X1, y1), (X2, y2),…, (X|D| , y|
■
D| ) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2
■
x2
Plus .Many nonlinear functions can be transformed into the
Solvable by extension of least square method or
Nonlinear Regression
■ Some nonlinear models can be modeled by a polynomial
function
■ A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3=
x3 y = w0 + w1 x + w2 x2 + w3 x3
■
Other functions, such as power function, can also be
transformed to linear model
■
Some models are intractable nonlinear (e.g., sum of
exponential terms)
■
possible to obtain least square estimates
through....
Other Regression-Based Models
■ Generalized linear model:
■ Foundation on which linear regression can be applied to modeling
categorical response variables
■ Variance of y is a function of the mean value of y, not a constant
■ Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
■ Poisson regression: models the data that exhibit a Poisson
distribution
■ Log-linear models: (for categorical data)
■ Approximate discrete multidimensional prob. distributions
■ Also useful for data compression and smoothing
■ Regression trees and model trees
■ Trees to predict continuous values rather than class labels
Regression Trees and
Model Trees
■ Regression tree: proposed in CART system (Breiman et al. 1984)
■ CART: Classification And Regression Trees
■ Each leaf stores a continuous-valued
■ prediction
It is the average value of the predicted attribute for the
■ training
Model tree: tuples that by
proposed reach the leaf
Quinlan (1992)
■ Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
■ A more general case than regression tree
■ Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
Predictive Modeling in Multidimensional
Databases
■ Predictive modeling: Predict data values or construct
generalized linear models based on the database data
■ One can only predict value ranges or category distributions
■ Method outline:
■ Minimal generalization
■ Attribute relevance analysis
■ Generalized linear model construction
■ Prediction
■ Determine the major factors which influence the prediction

■ Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.

■ Multi-level prediction: drill-down and roll-up analysis

Chapter 4

Uploaded by

Copyright:

Available Formats

Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4

Uploaded by

Copyright:

Available Formats

D a t a Science in Business

Chapter 4 — Classification and

Dr. LE SONG THANH QUYNH

■ What is classification? What is ■ Support Vector Machines

training set and the values (class labels) in a

as determined by the class label attribute

tuples whose class labels are not known

NAME RANK YEARS TENURED Classifier

■ attributes Data transformation

<=30 31....40 >40

student? yes credit rating?

no yes excellent fair

1. At each node, choose the “best”

2. Extend tree by adding new branch for

3. Sorting training examples to leaf

4. If examples in a node belong to one

• A typical dataset in machine learning

This tree is much simpler as “outlook” is selected at the

• If attributes “humidity” and “wind” split S into sub-nodes with

[6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-]

• Entropy characterizes the impurity (purity) of an

• Entropy is defined as follows

The entropy function relative to

objects varies between 0 and 1.

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆w e a k − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟 𝑜 𝑛 g)

S:[9+, 5-] S:[9+, 5-]

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

Gain(S, Humidity) Gain(S, Wind)

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) =

• If attribute has many values (e.g., days of the

■ Classification—a classical problem extensively studied by

■ comparable classification accuracy with other methods

■ SLIQ (EDBT’96 — Mehta et al.)

the current attribute list reside in memory

the tree earlier

■ Issues regarding classification ■ Associative classification

and prediction ■ Lazy learners (or learning from

■ Classification by decision tree your neighbors)

induction ■ Other classification methods

■ Bayesian classification ■ Prediction

■ Rule-based classification ■ Accuracy and error measures

■ Classification by back ■ Ensemble methods

■ … P(X): probability that sample data is observed

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

Therefore, X belongs to class (buy-computer = ‘yes’)

■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

Prob(income = low) = 1/1003

To classify a new sample X:

Outlook Temperature Humidity Windy Class

We also have the probabilities

Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|

Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|

Therefore X takes class label N

P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

Prob(P|X) = Prob(P)Prob(sunny|P)Prob(cool|P)* Prob(high|P)*Prob(false|

Prob(N|X) = Prob(N)Prob(sunny|N)Prob(cool|N)* Prob(high|N)*Prob(false|