Chapter 4

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 103

D a t a Science in Business

Chapter 4 — Classification and


Prediction

Dr. LE SONG THANH QUYNH


Ho Chi Minh City University of Technology

1
Chapter 4. Classification and Prediction

■ What is classification? What is ■ Support Vector Machines


prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
Classification vs. Prediction
■ Classification

predicts categorical class labels (discrete or nominal)
■ classifies data (constructs a model) based on the

training set and the values (class labels) in a


classifying attribute and uses it in classifying new data
■ Prediction

models continuous-valued functions, i.e., predicts
unknown or missing values
■ Typical applications

Credit approval
■ Target marketing

■ Medical diagnosis

■ Fraud detection
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute



The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees,
or mathematical formulae
■ Model usage: for classifying future or unknown objects
■ Estimate accuracy of the model


The known label of test sample is compared with the
classified result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set, otherwise over-fitting
will occur
■ If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known


Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured =
‘yes’
Process (2): Using the Model
in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
■ Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)

The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues: Data Preparation

■ Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
■ Relevance analysis (feature selection)
Remove the irrelevant or redundant

■ attributes Data transformation



Generalize and/or normalize data
Issues: Evaluating Classification
Methods
■ Accuracy

classifier accuracy: predicting class label
■ predictor accuracy: guessing value of predicted

attributes
■ Speed

time to construct the model (training time)

time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability

understanding and insight provided by the
■ model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “
buys_computer”

age

<=30 31....40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree
Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Examples are partitioned recursively based on selected
■ attributes
Test attributes are selected on the basis of a heuristic or statistical
■ measurefor(e.g.,
Conditions information
stopping gain)
partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
■ There are no samples left
Tree construction general
algorithm
Two steps: recursively generate the tree 1-4),
and prune the tree (5)

1. At each node, choose the “best”


attribute by a given measure for
attribute selection

2. Extend tree by adding new branch for


each value of the attribute

3. Sorting training examples to leaf


nodes

4. If examples in a node belong to one


class Then Stop Else Repeat steps 1-4 for
leaf nodes
5. Prune the tree to avoid over-fitting
Example : Training data for concept
“play-­tennis
‐ ”

• A typical dataset in machine learning


• 14 objects belonging to two class {Y,
N} are observed on 4 properties.
• Dom(Outlook) =
{sunny, overcast, rain}
• Dom(Temperature) =
{hot, mild, cool}
• Dom(humidity) =
{high, normal}
• Dom(Wind) =
{weak, strong}
A decision tree for playing tennis
A simple decision tree for playing
tennis

This tree is much simpler as “outlook” is selected at the


root. How to select good attribute to split a decision
node?
Which attribute is the best?
• The “playing-tennis” set S contains 9 positive objects (+) and 5
negative objects (-), denote by [9+, 5-]

• If attributes “humidity” and “wind” split S into sub-nodes with


proportions of positive and negative objects as below, which
attribute is better?

A1 = humidity
[9+, 5-] A2 = wind
[9+, 5-]
normal high weak strong

[6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-]


Entropy

• Entropy characterizes the impurity (purity) of an


arbitrary collection of objects .
 S is the collection of positive and negative objects
 𝑝 is the proportion of positive objects in S
 𝑝 is the proportion of negative objects in S
 In the play-tennis example, these numbers are 14, 9/14 and
5/14, respectively

• Entropy is defined as follows


Entropy

The entropy function relative to


a Boolean classification, as the y
op
rt
n
proportion 𝑝 of positive e

objects varies between 0 and 1.


If the collection has c distinct
groups of objects then the
entropy is defined by

𝑝
Example : Training data for concept
“play-­tennis
‐ ”
From 14 examples of Play-Tennis, 9 positive and 5
negative objects (denote by [9+, 5-­‐] )
Entropy( [9+, 5-­]
‐ ) = − (9/14)log2(9/14) −
(5/14)log2(5/14)
= 0.940

Notice:
1. Entropy is 0 if all members of S belong to the same
class
2. Entropy is 1 if the collection contains an equal number
of positive and negative examples. If these numbers are
unequal, the entropy is between 0 and 1.
Information gain measures the
expected reduction in entropy
We define a measure, called information gain, of the effectiveness of an
attribute in classifying data. It is the expected reduction in entropy
caused by partitioning the objects according to this attribute

where Value(A) is the set of all possible values for attribute A, and Sv is
the subset of S for which A has value v .
Information gain measures the
expected reduction in entropy

Values(Wind) = {Weak, Strong}, S = [9+, 5-]


Sweak , the subnode with value “weak”, is [6+,
2-] Sstrong , the subnode with value “strong”, is [3+,
3-]
𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 𝑆𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣)
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝑣∈{w𝑒𝑎𝑘,𝑠𝑡𝑟𝑜𝑛g} 𝑆

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆w e a k − 14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟 𝑜 𝑛 g)


8 6

8 6
= 0.940 − 0.811 − 𝑥 1.0 = 0.048
14 14
Which attribute is the best classifier?

S:[9+, 5-] S:[9+, 5-]


E = 0.940 E = 0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E = 0.985 E = 0.592 E = 0.811 E = 1.00

Gain(S, Humidity) Gain(S, Wind)


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.00
= .151 = .048
Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) =

0.029
Next step in growing the decision tree
Attributes with many values

• If attribute has many values (e.g., days of the


month), ID3 will select it
• C4.5 uses GainRatio instead
Measures for attribute selection
Overfitting and Tree
Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or
outliers
■ Poor accuracy for unseen samples
■ Two approaches to avoid overfitting
■ Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold
■ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees

Use a set of data different from the training data to decide
which is the “best pruned tree”
Enhancements to Basic Decision Tree
Induction
■ Allow for continuous-valued attributes
■ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
■ Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
■ Attribute construction

Create new attributes based on existing ones that are
sparsely represented
■ This reduces fragmentation, repetition, and
replication
Classification in Large Databases

■ Classification—a classical problem extensively studied by


statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
■ Why decision tree induction in data mining?

relatively faster learning speed (than other classification
methods)
■ convertible to simple and easy to understand

classification rules
■ can use SQL queries for accessing databases

■ comparable classification accuracy with other methods


Scalable Decision Tree Induction Methods

■ SLIQ (EDBT’96 — Mehta et al.)


■ Builds an index for each attribute and only class list and

the current attribute list reside in memory


■ SPRINT (VLDB’96 — J. Shafer et al.)

Constructs an attribute list data
■ structure PUBLIC (VLDB’98 — Rastogi &
Shim)
■ Integrates tree splitting and tree pruning: stop growing

the tree earlier


■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC-list (attribute, value, class label)
■ BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)

Uses bootstrapping to create several small samples
Assignment 4 – 09/04/2024
Chapter 4 . Classification and
Prediction
■ What is classification? What is ■ Support Vector Machines

prediction? (SVM)

■ Issues regarding classification ■ Associative classification

and prediction ■ Lazy learners (or learning from

■ Classification by decision tree your neighbors)

induction ■ Other classification methods

■ Bayesian classification ■ Prediction

■ Rule-based classification ■ Accuracy and error measures

■ Classification by back ■ Ensemble methods


propagation ■ Model selection
■ Summary
Bayesian Classification: Why?
■ A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
■ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
■ Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem:
Basics
■ Let X be a data sample (“evidence”): class label is
■ unknown Let H be a hypothesis that X belongs to class C
■ Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
■ P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,

■ … P(X): probability that sample data is observed


■ P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds

E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Bayesian
Theorem
■ Given training data X, posteriori probability of a
hypothesis
H, P(H|X), follows
P ( Hthe )  P(X|
| XBayes theorem
H )P(H )
P(X)
■ Informally, this can be written as posteriori =
likelihood x prior/evidence
■ Predicts X belongs to Ci if the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Towards Naïve Bayesian
Classifier
■ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n- D attribute
vector X = (x1, x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.

Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem

P(X|Ci)P(Ci)
P(Ci | X) 
P(X)
■ Since P(X) is constant for all classes, only

P(Ci | X)  P(X|
needs to be maximized
C )P(C )
Derivation of Naïve Bayes
Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( | )  P( | )  P( | ) ... P(
 x Ci x Ci x Ci x | Ci)
P(X | C i )  k 1 2 n
k
■ This greatly reduces 1the computation cost: Only counts
the class distribution
■ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of Ci
■ in D)
based
If Ak is on Gaussian distribution
continous-valued, P(xk|Ci)with a mean
is usually μ and
computed
standard deviation σ 1 
( x) 2

g(x,  ,  )  e2 2

2
and P(xk|Ci) is
P (X | Ci)  g (xk ,  C , C
i i
Naïve Bayesian Classifier: Training
Dataset
age income student redit_rating _com
c
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes
Data sample >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Naïve Bayesian Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357


Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019


P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (buy-computer = ‘yes’)


Avoiding the 0-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
 P(x k |
P( X | C i ) 
Ci) k 1

■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=


medium (990), and income = high (10),
■ Use Laplacian correction (or Laplacian estimator)
■ Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
■ The “corrected” prob. estimates are close to their “uncorrected”

counterparts
Naïve Bayesian
Comment
Classifier:
■ Advantages s

Easy to implement

Good results obtained in most of the cases
■ Disadvantages

Assumption: class conditional independence, therefore
loss of accuracy
■ Practically, dependencies exist among variables


E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
■ How to deal with these dependencies?

Bayesian Belief Networks
Naive Bayesian Classifier Example
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

To classify a new sample X:


outlook = sunny
temperature = cool
humidity = high
windy = false
Naive Bayesian Classifier Example
Outlook Temperature Humidity Windy Class
overcast hot high false P
rain mild high false P
rain cool normal false P
overcast cool normal true P
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
9
overcast mild high true P
overcast hot normal false P

Outlook Temperature Humidity Windy Class


sunny hot high false N
sunny hot high true N
rain cool normal true N 5
sunny mild high false N
rain mild high true N
Naive Bayesian Classifier Example
Given the training set, we compute the probabilities:
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5

We also have the probabilities


P = 9/14
N = 5/14
Naive Bayesian Classifier Example
To classify a new sample X:
outlook = sunny
temperature = cool
humidity = high
windy = false

Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|

P) = 9/14*2/9*3/9*3/9*6/9 = 0.01

Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|

N) = 5/14*3/5*1/5*4/5*2/5 = 0.013

Therefore X takes class label N


Naive Bayesian Classifier Example
Second example X = <rain, hot, high, false>

P(X|p)·P(p) =

P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

P(X|n)·P(n) =

P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

Sample X is classified in class N (don’t play)


Bayesian Belief Networks

■ Bayesian belief network allows a subset of the variables


conditionally independent
■ A graphical model of causal relationships

Represents dependency among the variables
■ Gives a specification of joint probability distribution

❑ Nodes: random variables


❑ Links: dependency
X Y ❑X and Y are the parents of Z, and Y
is the parent of P
Z ❑ No dependency between Z and P
P
❑ Has no loops or cycles
Bayesian Belief Network: Exampl
An e
Family The conditional probability
Smoker
History table (CPT) for variable
LungCancer:
(FH, S) (FH, ~S)
LC 0.8 0.5 0.7 0.1
(~FH, S) (~FH, ~S)

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for


each possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a


particular combination of values of X,
from CPT:
Bayesian Belief n
P ( x1 ,..., xn )   P ( xi | P a r e n t s ( Y i ) )
Networks i 1
Training Bayesian Networks
■ Several scenarios:

Given both the network structure and all variables
observable: learn only the CPTs

Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,
analogous to neural network learning

Network structure unknown, all variables observable:
search through the model space to reconstruct
network topology
■ Unknown structure, all hidden variables: No good

algorithms known for this purpose


■ Ref. D. Heckerman: Bayesian networks for data mining
Chapter 6. Classification
and Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
■ Summary
Using IF-THEN Rules for
Classification
■ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer =
yes
■ Rule antecedent/precondition vs. rule consequent

■ Assessment of a rule: coverage and


accuracy
■ n = # of tuples covered by R
covers

■ ncorect = # of tuples correctly classified by R


coverage(R) = ncovers /|D| /* D: training data set
*/ accuracy(R) = ncorect / ncovers
If more than one rule is triggered, need conflict

resolution
■ Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
■ Class-based ordering: decreasing order of prevalence or
misclassification cost per class
■ Rule-based ordering (decision list ): rules are organized into one
long priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision
Tree
age?

<=30 31..40 >40


■ Rules are easier to understand than large trees student? credit rating?
yes
■ One rule is created for each path from the root to excellent fair
no yes
a leaf no yes
no yes
■ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
■ Rules are mutually exclusive and exhaustive
■ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = THEN buys_computer = yes
yes IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer =
yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from the Training
Data
■ Sequential covering algorithm: Extracts rules directly from training data
■ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
■ Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes

Steps:
■ Rules are learned one at a time

■ Each time a rule is learned, the tuples covered by the rules are
removed
■ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-

Rule?
Star with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
pos
condition FOIL _ Gain  pos'(log
2  )
pos' log 2 pos'neg' pos 
neg
It favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune(R)  pos  neg

Pos/neg are # of positive/negative tuples covered by R.


If FOIL_Prune is higher for the pruned version of R, prune R
Chapter 6. Classification
and Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
■ Summary
Classification: A Mathematical
Mapping
■ Classification:
■predicts categorical class labels

■ E.g., Personal homepage classification


■ x = (x , x , x , …), y = +1 or –1
i 1 2 3 i

■ x1 : # of a word “homepage”
x2 : # of a word


“welcome” Mathematically
■ x  X = n, y  Y = {+1, –

1}
■ We want a function f: X
Linear Classification

Binary Classification
x problem
x x
x x The data above the red
line belongs to class ‘x’
x x o
x x The data below red line
o belongs to class ‘o’
x o o
oo o Examples: SVM,
o
o Perceptron, Probabilistic
o o o
Classifiers
o
Discriminative Classifiers
■ Advantages

prediction accuracy is generally high

As compared to Bayesian methods – in general

robust, works when training examples contain errors

fast evaluation of the learned target function

Bayesian networks are normally slow
■ Criticism

long training time
■ difficult to understand the learned function (weights)


Bayesian networks can be used easily for pattern discovery
■ not easy to incorporate domain knowledge

Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
x2 •Scalar: x, y, w

Input: {(x1, y1),


…}
Output: classification function f(x)
f(xi) > 0 for yi = +1

f(x)i) =>
f(x wxyi+=b-1= 0
< 0 for
or w1x1+w2x2+b = 0
•Perceptron: update
W additively
•Winnow: update
W multiplicatively
x1
Perceptron & Winnow
Classification by
Backpropagation
■ Backpropagation: A neural network learning algorithm
■ Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
■ A neural network: A set of connected input/output units
where each connection has a weight associated with it
■ During the learning phase, the network learns by
adjusting the weights so as to be able to predict
the correct class label of the input tuples
■ Also referred to as connectionist learning due to
the connections between units
Neural Network as a Classifier
■ Weakness
■ Long training time
■ Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
■ Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
■ Strength
■ High tolerance to noisy data
■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
■ Successful on a wide array of real-world data
■ Algorithms are inherently parallel
■ Techniques have recently been developed for the extraction of
rules from trained neural networks
A Neuron (= a
perceptron)
x0 w0 - k
x1 w1
 f
output y
xn wn
For Example
n
Input weight weighted Activation y  sign( wi xi  k
vector vector sum function )
i0

x w
■ The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping
Neural Networks
What are they?
Based on early research aimed at representing the
way the human brain works
Neural networks are composed of many processing
units called neurons

Types (Supervised versus Unsupervised)


Training

65
Neural Networks are great, but..
Problem 1: The black box model!
Solution: 1. Do we really need to know?
Solution 2. Rule Extraction techniques

Problem 2: Long training times


Solution 1: Get a faster PC with lots of RAM
Solution 2: Use faster algorithms “For example:
Quickprop”

Problems 3: Back propagation


Solution: Evolutionary Neural Networks!

66
Neural Network Concepts

Neural networks (NN): a brain metaphor for information


processing

Neural computing
Artificial neural network (ANN)

Many uses for ANN for


pattern recognition, forecasting, prediction, and classification

Many application areas


finance, marketing, manufacturing, operations, information
systems, and so on
Biological Neural Networks

Dendrites
Synapse
Synapse

Axon

Axon

Dendrites Soma
Soma

Two interconnected brain cells (neurons)


Processing Information in ANN

Inputs Weights Outputs

x1
w1 Y1

x2 w2 Neuron (or PE) f (S )


. S  
n
X iW
Y
. Y2
. i 1
i

.
. Summation
Transfer
.
Function
wn Yn
xn

A single neuron (processing element – PE) with inputs and


outputs
Elements of ANN

Processing element (PE)


Network architecture
Hidden layers
Parallel processing
Network information processing
Inputs
Outputs
Connection weights
Summation function
Neural Network Architectures
Recurrent Neural Networks
A Supervised Learning Process

ANN
Model
Three-step process:
1. Compute temporary
Compute
output outputs
2. Compare outputs with
desired targets
3. Adjust the weights and
Is desired
Adjust
weights
No
output repeat the process
achieved?

Yes

Stop
learning
How a Network Learns

Example: single neuron that learns the inclusive OR


operation

Learning parameters:
 Learning rate

 Momentum
Backpropagation Learning

a(Zi – Yi)
x1 error
w1

x2 w2 Neuron (or PE) f (S )


. S  
n
X iW i
Y  f (S ) Yi
. i 1

. Summation
Transfer
Function
wn
xn

Backpropagation of Error for a Single Neuron


Backpropagation Learning

The learning algorithm procedure:


1. Initialize weights with random values and set other
network parameters
2. Read in the inputs and the desired outputs
3. Compute the actual output (by working forward
through the layers)
4. Compute the error (difference between the actual
and desired output)
5. Change the weights by working backward through
the hidden layers
6. Repeat steps 2-5 until weights stabilize
Development Process of an
ANN
Neural Network Architectures

Architecture of a neural network is driven by the task it


is intended to address Classification, regression,
clustering, general optimization, association, ….

Most popular architecture: Feedforward, multi-layered


perceptron with backpropagation learning algorithm
used for both classification and regression type
problems
Other Popular ANN Paradigms
Self Organizing Maps (SOM)
Applications of SOM
Customer segmentation
Bibliographic classification
Image-browsing systems
Medical diagnosis
Interpretation of seismic activity
Speech recognition
Data compression
Environmental modeling, many more …
Application of ANN

Forecasting/Market Prediction: finance and banking

Manufacturing: quality control, fault diagnosis

Medicine: analysis of electrocardiogram data, RNA &


DNA sequencing, drug development without animal
testing

Control: process, robotics


Chapter 6. Classification
and Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection

SVM—Support Vector Machines
■ A new classification method for both linear and nonlinear
data
■ It uses a nonlinear mapping to transform the original
training data into a higher dimension
■ With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
■ With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
■ SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
SVM—History and Applications
■ Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
■ Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
■ Used both for classification and prediction
■ Applications:

handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
SVM—General Philosophy

Small Margin Large Margin


Support Vectors
SVM—Margins and
Support Vectors
SVM—When Data Is
Linearly Separabl
e

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi

There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
SVM—Linearly Separable

A separating hyperplane can be written as W
●X+b=0
where W={w1, w2, …, wn} is a weight vector

and b a scalar (bias)
For 2-D it can be written as w0

+ w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2:

w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1


Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints Quadratic
Programming ► Lagrangian multipliers
Why Is SVM Effective on High
Dimensional Data?

The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data

The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)

If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found

The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
A
2

SVM—Linearly
Inseparable
A

Transform the original input data into a higher dimensional


1

space

■ Search for a linear separating hyperplane in the new space


SVM—Kernel functions
■ Instead of computing the dot product on the transformed data tuples, it
is mathematically equivalent to instead applying a kernel function
K(Xi,
■ Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions

■ SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Scaling SVM by Hierarchical Micro-
Clustering
■ SVM is not scalable to the number of data objects in terms of training
time and memory usage
■ “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
■ CB-SVM (Clustering-Based SVM)
■ Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
■ Use micro-clustering to effectively reduce the number of points to
be considered
■ At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
CB-SVM: Clustering-Based SVM
■ Training data sets may not even fit in memory
■ Read the data set once (minimizing disk access)
■ Construct a statistical summary of the data (i.e., hierarchical
clusters) given a limited amount of memory
■ The statistical summary maximizes the benefit of learning
SVM
■ The summary plays a role in indexing SVMs
■ Essence of Micro-clustering (Hierarchical indexing structure)
■ Use micro-cluster hierarchical indexing structure

provide finer samples closer to the boundary and coarser
samples farther from the boundary
■ Selective de-clustering to ensure high accuracy
CF-Tree: Hierarchical Micro-cluster
CB-SVM Algorithm: Outline
■ Construct two CF-trees from positive and negative data
sets independently

Need one scan of the data set
■ Train an SVM from the centroids of the root entries
■ De-cluster the entries near the boundary into the next
level
■ The children entries de-clustered from the parent

entries are accumulated into the training set with the


non-declustered parent entries
■ Train an SVM again from the centroids of the entries in the
training set
■ Repeat until nothing is accumulated
Selective Declustering
■ CF tree is a suitable base structure for selective declustering
■ De-cluster only the cluster Ei such that
■ Di – Ri < Ds, where Di is the distance from the boundary to
the center point of Ei and Ri is the radius of Ei

Decluster only the cluster whose subclusters have
possibilities to be the support cluster of the boundary

“Support cluster”: The cluster whose centroid is a
support vector
Experiment on Synthetic Dataset
Experiment on a Large Data
Set
SVM vs. Neural Network

■ SVM ■ Neural Network



Relatively new concept ■
Relatively old
■ Nondeterministic
■ Deterministic algorithm
algorithm
■ Nice Generalization ■ Generalizes well but

properties doesn’t have strong


■ Hard to learn – learned mathematical foundation
■ Can easily be learned in
in batch mode using
incremental fashion
quadratic programming
■ To learn complex
techniques
functions—use multilayer
■ Using kernels can learn perceptron (not that
very complex functions trivial)
What Is Prediction?
■ (Numerical) prediction is similar to classification
■ construct a model

■ use model to predict continuous or ordered value for a given input

■ Prediction is different from classification


■ Classification refers to predict categorical class label

■ Prediction models continuous-valued functions

■ Major method for prediction: regression



model the relationship between one or more independent or
predictor variables and a dependent or response
■ variable
Regression
■ Linear andanalysis
multiple regression
■ Non-linear regression
■ Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear Regression
■ Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression

coefficients Method|D|of least squares: estimates the best-fitting
straight line
w  i1 w yw
1 |D|
0 1

(x i x)( y
i1 (x  x)
i
2
i  y) x
■ Multiple linear regression: involves more than one predictor variable
■ Training data is of the form (X1, y1), (X2, y2),…, (X|D| , y|

D| ) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2

x2
Plus .Many nonlinear functions can be transformed into the
Solvable by extension of least square method or
Nonlinear Regression
■ Some nonlinear models can be modeled by a polynomial
function
■ A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3=
x3 y = w0 + w1 x + w2 x2 + w3 x3

Other functions, such as power function, can also be
transformed to linear model

Some models are intractable nonlinear (e.g., sum of
exponential terms)

possible to obtain least square estimates
through....
Other Regression-Based Models
■ Generalized linear model:
■ Foundation on which linear regression can be applied to modeling
categorical response variables
■ Variance of y is a function of the mean value of y, not a constant
■ Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
■ Poisson regression: models the data that exhibit a Poisson
distribution
■ Log-linear models: (for categorical data)
■ Approximate discrete multidimensional prob. distributions
■ Also useful for data compression and smoothing
■ Regression trees and model trees
■ Trees to predict continuous values rather than class labels
Regression Trees and
Model Trees
■ Regression tree: proposed in CART system (Breiman et al. 1984)
■ CART: Classification And Regression Trees
■ Each leaf stores a continuous-valued
■ prediction
It is the average value of the predicted attribute for the
■ training
Model tree: tuples that by
proposed reach the leaf
Quinlan (1992)
■ Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
■ A more general case than regression tree
■ Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
Predictive Modeling in Multidimensional
Databases
■ Predictive modeling: Predict data values or construct
generalized linear models based on the database data
■ One can only predict value ranges or category distributions
■ Method outline:
■ Minimal generalization

■ Attribute relevance analysis

■ Generalized linear model construction

■ Prediction

■ Determine the major factors which influence the prediction


■ Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc.


■ Multi-level prediction: drill-down and roll-up analysis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy