0% found this document useful (0 votes)
22 views43 pages

Introduction (15 Files Merged)

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views43 pages

Introduction (15 Files Merged)

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Artificial Intelligence

Artificial Intelligence
How to create a Learner
Artificial Intelligence

Machine Learning + = Steps:

: An Introduction  Artificial: designed or produce by human effort rather than naturally.


1.
2.
Choose the training experience*/data
Choose the target function (that is to be learned)
 Intelligence: Ability to learn and apply the knowledge for problem solving.
3. Choose, how to represent the target function**
Intelligence of a machine is basically: 4. Choose the learning algorithm to infer the target
Ref & Acknowledgments function
 The ability to solve the problem
1. Dr Anoop Patel, NIT Kurukshetra  Ability to act rationally * can be expressed as features
2. Dr Dinesh Kumar, ECE Dept, Delhi Technological University  Ability to behave like human being for specific purpose ** class of function
3. M. Pradhan, U. Dinesh Kumar, “Machine Learning Using
Python”, Wiley India, 2019.
11

Analytics/AI/ML/DL Machine Learning ML in a Nutshell Representation

• Analytics – a collection of techniques such as artificial intelligence, Data Program Data Output • Tens of thousands of machine  Decision trees
machine learning and deep learning and tools used for creating value learning algorithms  Sets of rules / Logic programs
from data.  Instances
Computer Computer
• Hundreds new every year  Graphical models (Bayes/Markov
• Artificial Intelligence (AI) : Algorithms and systems that nets)
exhibit human-like intelligence. Output Program  Neural networks
• Machine Learning (ML): Subset of AI that can learn to perform a task • Every machine learning algorithm  Support vector machines
with extracted data and/or models. Programmed Solution Machine Learning Solution has three components:  Model ensembles etc.
• Deep Learning (DL): Subset of machine learning that imitate the  Representation
functioning of human brain to solve problems. • ML Explores algorithm  Evaluation
• Learn/build model from data  Optimization

Machine Learning: Definition Evaluation Optimization


Why Machine Learning? Related Fields?  Accuracy
Learning is the ability to improve
No human experts once behaviour based on experience.  Precision and recall  Combinatorial optimization
industrial/manufacturing control data
mass spectrometer analysis, drug design, astronomic discovery mining control theory Build computer system that A computer program is said to learn e.g.: Greedy search
Black-box human expertise automatically improve with from experience E with respect to  Squared error
statistics
face/handwriting/speech recognition
experience. some class of tasks T and performance  Convex optimization
driving a car, flying a plane decision theory  Likelihood
Rapidly changing phenomena information theory machine
Machine learning explore algorithm measure P, if its performance in tasks e.g.: Gradient descent
learning
credit scoring, financial modeling cognitive science
that can T, as measured by P, improves with  Posterior probability
diagnosis, fraud detection databases
Need for customization/personalization o learn from data/build a model experience E.  Constrained optimization
personalized news reader
psychological models
 Cost / Utility
movie/book recommendation
evolutionary neuroscience from data e.g.: Linear programming
models
o use the model for prediction, [Tom Mitchell]  Margin
decision making, or solving some
Machine learning is primarily concerned with the accuracy task  Entropy
and effectiveness of the computer system.
 K-L divergence etc.

Progress of Machine Learning Machine Learning: Definition Steps for ML algorithm


1980 – The First Machine Learning Workshop was held at Carnegie-Mellon University
in Pittsburgh.  Ability to learn without being explicitly programmed. • Identify the problem or opportunity for value creation.
1980 – Three consecutive issues of the International Journal of Policy Analysis and
Information Systems were specially devoted to machine learning. • Identify sources of data and create a data lake.
1981 - Hinton, Jordan, Sejnowski, Rumelhart, McLeland at UCSD  Machine Learning is employed in a range of computing tasks where
• Pre-process the data for issues such as missing and incorrect data.
Back Propagation alg. PDP Book designing and programming explicit algorithms with good
1986 – The establishment of the Machine Learning journal. performance is difficult or unfeasible. • Generate derived variables and transform the data if necessary.
1987 – The beginning of annual international conferences on machine learning (ICML).
Snowbird ML conference
• Divide the datasets into subsets of training and validation datasets.
1988 – The beginning of regular workshops on computational learning theory (COLT).  ML is closely related to the computational statistics. • Build ML models to identify the best models(s) using model
1990’s – Explosive growth in the field of data mining, which involves the application of performance in validation data.
machine learning techniques.
 Mathematics is the base of the learning algorithms. • Implement Solution/Decision/Develop Product

Black-box Learner Learner Types of Learning Supervised Learning


Relationship between AI, ML, and DL Supervised Learning
• Input data is labelled.
• Supervised learning is where there is an
Experiences Data Problem/Task Experiences Data Problem/Task
input variables (X) and an output variable New Input x
x y
(Y) and an algorithm is used to learn the
mapping function from the input to the Input1 output1
Models
output. Y = f(X) Input2 output2 Learning Model
Learner Reasoner Input3 output3 Algorithm
• The goal is to approximate the mapping
function so well that when you have new Input-n output-n

input data (x) that you can predict the output Output y

Answer/ Answer/
variables (Y) for that data.
Background Background
Knowledge/ Bias Performance Knowledge/ Bias Performance • It is called supervised learning because the
process of an algorithm learning from the
training dataset can be thought of as a
teacher supervising the learning process.
Unsupervised Learning Framework for Developing ML Models
Classification of ML algorithms
Input is not labelled. Clustering
Learning based on inherent property. Supervised learning algorithms
e.g. clustering
• require the knowledge of both outcome variable (dependent
variable) and the features (independent variable)
x y Clusters
• Algorithm learns by defining a loss function
Input1 output1
Learning • Loss function is usually a function of the difference between the
Input2 output2 Algorithm predicted value and actual value of the outcome variable.
Input3 output3
Input-n • Examples: linear regression, logistic regression, discriminant analysis,
etc.

Supervised vs Unsupervised
Classification of ML algorithms (Cntd.)
Unsupervised learning algorithms Introduction to Machine Learning Class Notes
Supervised Learning Model • Set of algorithms which do not have the knowledge of the outcome
variable in the dataset. Huy Nguyen
PhD Student, Human-Computer Interaction Institute

• Algorithms must find the possible values of the outcome variable.


Carnegie Mellon University

• Examples: clustering, principal component analysis, etc.

Unsupervised Learning Model

Contents

Reinforcement Learning
Classification of ML algorithms (Cntd.)
Preface 3

1 MLE and MAP 4


1.1 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Bayesian learning and MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Main idea: Learning with a Delayed Reward 2 Nonparametric models: KNN and kernel regression 7

Reinforcement learning algorithms 2.1 Bayes decision rule . . . . .


2.2 Classification . . . . . . . .
. . .
. . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
2.3 K-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Uses dynamic programming and supervised • Algorithms that have to take sequential actions (decisions) to 2.4 Local Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

learning maximize a cumulative reward 3 Linear Regression


3.1 Basic linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
Action at 3.2 Multivariate and general linear regression . . . . . . . . . . . . . . . . . . . . . . 12

Addresses problems that can not be addressed by


• There is uncertainty around both input as well as the output 3.3 Regularized least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13

regular supervised methods State st st+1 variables. 3.3.2 Connection to MLE and MAP . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Logistic Regression 16

e.g., Useful for Control Problems. Agent Environment • Examples: Markov chain, Markov decision process, etc. 4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Training logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17

Reward rt rt+1 Evolutionary learning algorithms 5 Naive Bayes Classifier 20

Dynamic programming searches for optimal


5.1 Gaussian Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

policies. • Algorithms that imitate natural evolution to solve a problem. 5.3 Text classification: bag of words model . . . . . . . . . . . . . . . . . . . . . . .
5.4 Generative vs Discriminative Classifer . . . . . . . . . . . . . . . . . . . . . . . .
23
23

• Examples: genetic algorithms and ant colony optimization. 6 Neural Networks and Deep Learning
6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
25
6.2 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Support Vector Machine 33


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Primal form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2.1 Linearly separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2.2 Non linearly separable case . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Dual representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3.1 Linearly separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3.2 Transformation of inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Applications of ML
7.3.3 Kernel tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Flow of Learning
7.3.4 Non linearly separable case . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4 Other topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4.1 Why do SVMs work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4.2 Multi-class classification with SVM . . . . . . . . . . . . . . . . . . . . . 41

8 Ensemble Methods and Boosting 42


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Mathematical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2.1 What αt to choose for hypothesis ht ? . . . . . . . . . . . . . . . . . . . . 45
8.2.2 Show that training error converges to 0 . . . . . . . . . . . . . . . . . . . 46

9 Principal Component Analysis 49


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Tele robotics 9.2 PCA algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 PCA applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3.1 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3.2 Image compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4 Shortcomings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

10 Hidden Markov Model 54


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Robotic Navigation 10.2 Inference in HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.2.1 What is P (qt = si )? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Risk 10.2.2 What is P (qt = si | o1 o2 . . . ot )? . . . . . . . . . . . . . . . . . . . . . . . 57
Assessment 10.2.3 What is arg max P (q1 q2 . . . qt | o1 o2 . . . ot )? . . . . . . . . . . . . . . . . . 57
q1 q2 ...qt

11 Reinforcement Learning 59
11.1 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Reinforcement learning - No action . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.2.1 Supervised RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2.2 Certainty-Equivalence learning . . . . . . . . . . . . . . . . . . . . . . . . 62
Virtual Exercise Rehabilitation Assistant-VERA
11.2.3 Temporal difference learning . . . . . . . . . . . . . . . . . . . . . . . . . 63

Supervised Unsupervised Reinforcement 11.3 Reinforcement learning with action - Policy learning . . . . . . . . . . . . . . . . 64

12 Generalization and Model Selection 66


12.1 True risk vs Empirical risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Preface

Types of Learning This is the class notes I took for CMU’s 10701: Introduction to Machine Learning in Fall 2018.

Why Machine Learning?


The goal of this document is to serve as a quick review of key points from each topic covered in
the course. A more comprehensive note collection for beginners is available at UPenn’s CIS520:
Machine Learning.
In this document, each chapter typically covers one machine learning methodology and contains
the followings:

• It helps in understanding the association between key performance


• Definition - definition of important concepts.

• Diving in the Math - mathematical proof for a statement / formula.


indicators (KPIs). • Algorithm - the steps to perform a common routine / subroutine.

• Identifying the factors that have a significant impact on the KPIs for Intertwined with these components are transitional text (as I find them easier to review than
bullet points), so the document as a whole ends up looking like a mini textbook. While there
effective management. are already plenty of ML textbooks out there, I am still interested in writing up something
that stays closest to the content taught by Professor Ziv Bar-Joseph and Pradeep Ravikumar.

• Knowledge of the relationship between KPIs and factors would


I would also like to take this opportunity to thank the two professors for their guidance.

provide decision maker with appropriate action items.


• Used for identifying the factors that influence the KPIs, which helps in
decision making and value creation.
• Organizations such as Amazon, Google, Capital One, IBM, Facebook
are using ML algorithms to create new products and solutions.

3
Diving in the Math 2 - Computing probabilities for KNN In general, we can phrase the problem as finding
Let z be the new point we want to classify. Let V be the volume of the m dimensional ball
R around z containing the K nearest neighbors for z (where m is the number of features). ŵ = arg min(Φw − y)T (Φw − y) + λpen(w), (3.8)
w
Also assume that the distribution in R is uniform.
Consider the probability P that a data point chosen at random is in R. On one hand, where pen(w) is a penalty function. Here’s a visualization for different kinds of penalty functions:
Chapter 1 because there are K points in R out of a total of N points, P = K N
. On the other hand, let
P (x) = q beZ the density at a point x ∈ R (q is constant because R has uniform distribution).
Then P = P (x)dx = qV . Hence we see that the marginal probability of z is
MLE and MAP x∈R

P K
P (z) = q = = .
V NV
Similarly, the conditional probability of z given a class i is
1.1 MLE
Ki
P (z | y = i) = .
Definition 1: (Likelihood function and MLE) Ni V
Given n data points x1 , x2 , . . . , xn we can define the likelihood of the data given the model
θ (usually a collection of parameters) as follows. Finally, we compute the prior of class i:

n
Y Ni
P (y = i) = .
P̂ (dataset | θ) = P̂ (xk | θ). (1.1) N
k=1
Using Bayes formula:
The maximum likelihood estimate (MLE) of θ is 3.3.2 Connection to MLE and MAP
P (z | y = i)P (y = i) Ki
P (y = i | z) = = .
θ̂M LE = arg max P̂ (dataset | θ). (1.2) P (z) K Consider a linear regression problem
θ
Using the Bayes decision rule we will choose the class with the highest probability, which Y = fˆ(X) + ǫ = X ŵ + ǫ,
To determine the values for the parameters in θ, we maximize the probability of generating the corresponds to the class with the highest Ki - the number of samples in K.
where the noise ǫ ∼ N (0, σ 2 I), which implies Y ∼ N (X ŵ, σ 2 I). If ΦT Φ is invertible, then w
observed samples. For example, let θ be the model of o θ = {P (Head) = q}), then
n a coin flip (so
can be determined exactly by MCLE:
the best assignment (MLE) for θ in this case is θ̂ = q̂ = ##samples
heads
.
2.4 Local Kernel Regression ŵM CLE = arg max P ({yi }ni=1 | w, σ 2 , {xi }ni=1 )
Diving in the Math 1 - MLE for binary variable w | {z }
#1
For a binary random variable A with P (A = 1) = q, we show that q̂ = # samples . Kernel regression is similar to KNN but used for regression. In particular, it focuses on a specific Conditional log likelihood
n
X
Assume we observe n samples x1 , x2 , . . . , xn with n1 heads and n2 tails. region of the input, as opposed to the global space (like linear regression does). For example,
we can output the local average: = arg min (xi w − yi )2 = ŵLR , (3.9)
Then, the likelihood function is w
i=1
Pn
n
Y i=1 yi · I (kxi − xk ≤ h)
fˆ(x) = P , (2.4) where the last equality follows from (3.1). In other words, Least Square Estimate is the same as
P (D | θ) = P (xi | θ) = q n1 (1 − q)n2 . n
i=1 I (kxi − xk ≤ h)
i=1
MCLE under a Gaussian model.
which can also be expressed in a form similar to linear regression:
We now find q̂ that maximizes this likelihood function, i.e., q̂ = arg max q n1 (1 − q)n2 . In case ΦT Φ is not invertible, we can encode the Ridge bias by letting w ∼ N (0, τ 2 I) and
q n
X I (kxi − xk ≤ h) P (w) ∝ exp(−wT w/2τ 2 ), which would yield
To do so, we set the derivative to 0: fˆ(x) = w i yi , w i = P .
n
i=1 I j=1 kxj − xk ≤ h ŵM CAP = arg max P ({yi }ni=1 | w, σ 2 , {xi }ni=1 ) + log P (w)
∂ n1 | {z } | {z }
0= q (1 − q)n2 = n1 q n1 −1 (1 − q)n2 − q n1 n2 (1 − q)n2 −1 , w
∂q Note that the wi ’s here represent a hard boundary: if Xi is close to x then wi = 1, else wi = 0.
Conditional log likelihood log prior
n
X
which is equivalent to In the general case, w can be expressed using kernel functions. = arg min (xi w − yi )2 + λ kwk22 = ŵRidge , (3.10)
q n1 −1 (1 − q)n2 −1 (n1 (1 − q) − qn2 ) = 0, w
i=1

which yields where λ is constant in terms of σ 2 and τ 2 , and the last equality follows from (3.6). In other
n1 words, Prior belief that w is Gaussian with mean 0 biases solution to “small” w.
n1 (1 − q) − qn2 = 0 ⇔ q =
n1 + n2

4 9 14

When working with products, probabilities of entire datasets often get too small. A possible Algorithm 5: (Nadaraya-Watson Kernel Regression) Diving in the Math 5 - Ridge regression and MCAP
solution is to use the log of probabilities, often termed log likelihood 1 : Given n data points {(xi , yi )}ni=1 , we can output the value at a new point x as Since we are given P (w) ∝ exp(−wT w/2τ 2 ), let P (w) = exp(−c kwk22 ), where c is some
n n  constant, then − log P (w) = c kwk22 , so (3.10) is equivalent to finding
Y X X n
log P̂ (dataset | M ) = log P̂ (xk | M ) = log P̂ (xk | M ). (1.3) K x−x i

fˆ(x) = w i yi , w i = Pn h
x−xi
, (2.5) inf {L(w) + c kwk22 }
k=1 k=1 i=1 j=1 K h w

In this case, the algorithm for MLE is as follows. = inf {L(w)} such that kwk22 ≤ L(c),
where K is a kernel function. Some typical kernel functions include: w

Algorithm 1: (Finding the MLE) • Boxcar kernel: K(t) = I (t ≤ 1). where L(c) is a bijective function of c. So adding c kwk22 is the same as the ridge regression
Given n data points x1 , x2 , . . . , xn and a model θ represented by an expression for P (X | θ),  2 constraint kwk22 ≤ L for some constant L.
1
perform the following steps: • Gaussian kernel: K(t) = √ exp − t2 .

1. Compute the log-likelihood Similarly, we can encode the Lasso bias by letting wi ∼ Laplace(0, t) (iid) and P (wi ) ∝
exp(−|wi |/t), which would yield
n
Y n
X The distance h in this case the called the kernel bandwidth. The choice of h should depend
L = log P (xi | θ) = log P (xi | θ). on the number of training data (determines variance) and smoothness of function (determines ŵM CAP = arg max P ({yi }ni=1 | w, σ 2 , {xi }ni=1 ) + log P (w)
i=1 i=1 bias). w | {z } | {z }
Conditional log likelihood log prior
n
X
∂L • Large bandwidth averages more data points so reduces noice (lower variance).
2. For each parameter γ in θ, find the solution(s) to the equation ∂γ
= 0. = arg min (xi w − yi )2 + λ kwk1 = ŵLasso , (3.11)
w
• Small bandwidth fits more accurately (lower bias). i=1
∂ 2L
3. The solution γ̂ that satisfies 2
≤ 0 is the MLE of γ. where λ is constant in terms of σ 2 and t, and the last equality follows from (3.7). In other words,
∂γ̂
Prior belief that w is Laplace with mean 0 biases solution to “sparse” w.

1.2 Bayesian learning and MAP


We first note the Bayes formula

P (B | A)P (A) P (B | A)P (A)


P (A | B) = =X . (1.4)
P (B) P (B | A)P (A)
A

In Bayesian learning, prior information is encoded as a distribution over possible values of


parameters P (M ). Using (1.4), we get an updated posterior distribution over parameters. To
derive the estimate of true parameter, we choose the value that maximizes posterior probability.

Definition 2: (MAP)
Given a dataset and a model M with prior P (M ), the maximum a posteriori (MAP) estimate
of M is
θ̂M AP = arg max P (θ | dataset) = arg max P (dataset | θ)P (θ). (1.5)
θ θ

1
Note that because log t is monotonous on R, maximizing log t is the same as maximizing t.
In general this is the bias-variance tradeoff. Bias represents how accurate the result is (lower
bias = more accurate). Variance represents how sensitive the algorithm is to changes in the
input (lower variance = less sensitive). Here a large bandwidth (h = 200) yields low variance
and high bias, while a small bandwidth (h = 1) yields high variance and low bias. In this case,
h = 50 seems like the best middle ground.

5 10 15

Chapter 3 Chapter 4

Linear Regression Logistic Regression

If we only have very few samples, MLE may not yield accurate results, so it is useful to take into 3.1 Basic linear regression 4.1 Definition
account prior knowledge. When the number of samples gets large, the effect of prior knowledge
will diminish. Definition 3: (Linear Regression) We know that regression is for predicting real-valued output Y , while classification is for pre-
Similar to MLE, we have the following algorithm for MAP. Given an input x we would like to compute an output y as dicting (finite) discrete-valued Y . But is there a way to connect regression to classification? Can
we predict the “probability” of a class label? The answer is generally yes, but we have to keep
Algorithm 2: (Finding the MAP) y = wx + ǫ, in mind the constraint that the probability value should lie in [0, 1].
Given n data points x1 , x2 , . . . , xn , a model θ represented by an expression for P (X | θ), and
the prior knowledge P (θ), perform the following steps: where w is a parameter and ǫ represents measurement of noise. Definition 4: (Logistic Regression)
Assume the following functional form for P (Y | X):
1. Compute the log-likelihood
Our goal is to estimate w from training data of (xi , yi ) pairs. One way is to find the least square
1
n
Y n
X error (LSE) X P (Y = 1 | X) = P , (4.1)
L = log P (θ) · P (xi | θ) = log P (θ) + log P (xi | θ). 1 + exp(−(w0 + i wi Xi ))
ŵLR = arg min (yi − wxi )2 (3.1)
i=1 i=1 w 1
i P (Y = 0 | X) = P . (4.2)
1 + exp(w0 + i wi Xi )
∂L
which minimizes squared distance between measurements and predicted lines. LSE has a nice
2. For each parameter γ in θ, find the solution(s) to the equation ∂γ
= 0. probabilistic interpretation (as we will see shortly, if ǫ ∼ N (0, σ 2 ) then ŵ is MLE of w). and is
1
easy to compute. In particular, the solution to (3.1) is In essence, logistic regression means applying the logistic function σ(z) = 1+exp(−z)
to a linear
∂ 2L function of the data. However, note that it is still a linear classifier.
3. The solution γ̂ that satisfies ≤ 0 is the MAP of γ. P
∂γ̂ 2 xi yi
ŵ = Pi 2 . (3.2)
i xi Diving in the Math 6 - Logistic Regression as linear classifier
Note that P (Y = 1 | X) can be rewritten as
P
Diving in the Math 3 - Solving linear regression using LSE exp(w0 + i wi Xi )
P (Y = 1 | X) = P .
We take the derivative w.r.t w and set to 0: 1 + exp(w0 + i wi Xi )
∂ X X
0= (yi − wxi )2 = −2 xi (yi − wxi ), We would assign label 1 if P (Y = 1 | X) > P (Y = 0 | X), which is equivalent to
∂w i i X X
exp(w0 + w i Xi ) > 1 ⇔ w 0 + wi Xi > 0.
which yields
X X P i i
xi yi
xi yi = wx2i ⇒ w = Pi 2 Similarly, we would assign label 0 if P (Y = 1 | X) < P (Y = 0 | X), which is equivalent to
i i i xi
X X
exp(w0 + w i Xi ) < 1 ⇔ w 0 + wi Xi < 0.
If the line does not pass through the origin, simply change the model to i i
P
y = w0 + w1 x + ǫ, In other words, the decision boundary is the line w0 + i wi Xi , which is linear.
and following the same process gives
P P
yi − w 1 xi xi (yi − w0 )
w0 = i , w1 = i P 2 . (3.3)
n i xi

6 11 16

3.2 Multivariate and general linear regression 4.2 Training logistic regression
If we have several inputs, this becomes a multivariate regression problem: Given training data {(xi , yi )}ni=1 where the input has d features, we want to learn the parameters
y = w0 + w1 x1 + . . . + wk xk + ǫ. w0 , w1 , . . . , wd . We can do so by MCLE:
n
Y
However, not all functions can be approximated using the input values directly. In some cases we
Chapter 2 would like to use polynomial or other terms based on the input data. As long as the coefficients ŵM CLE = arg max
w
i=1
P (y (i) | x(i) , w). (4.3)
are linear, the equation is still a linear regression problem. For instance,
y = w0 x1 + w1 x21 + . . . + wk x2k + ǫ. Note the Discriminative philosophy: don’t waste effort learning P (X), focus on P (Y | X) - that’s
Nonparametric models: KNN and Typical non-linear basis functions include:
all that matters for classification! Using (4.1) and (4.2), we can then compute the log-likelihood:
!
kernel regression • Polynomial φj (x) = xj ,
l(w) = ln
n
Y
P (y (i) | x(i) , w)
(x−µj )2
• Gaussian φj (x) = 2σj2
, "
i=1
#
n
X d
X d
X
(i) (i)
• Sigmoid φj (x) = 1
. = y (i) (w0 + wi xj ) − ln(1 + exp(w0 + wi xj )) . (4.4)
1+exp(−sj x)
2.1 Bayes decision rule Using this new notation, we formulate the general linear regression problem:
i=1 j=1 j=1

X There is no closed-form solution to maximize l(w), but we note that it is a concave function.
Classification is the task of predicting a (discrete) output label given the input data. The y= wj φj (x),
performance of any classification algorithm depends on two factors: (1) the parameters are j
correct, and (2) the underlying assumption holds. The most optimal algorithm is called the Definition 5: (Concave function)
where φj (x) can either be xj for multivariate regression or one of the non-linear bases we defined. A function l(w) is called concave if the line joining two points l(w1 ), l(w2 ) on the function
Bayes decision rule. Now assume the general case where we where have n data points does not lie above the function on the interval [w1 , w2 ].
(x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(n) , y (n) ), and each data point has k features (recall that feature j of
Algorithm 3: (Bayes decision rule) (i)
x(i) is denoted xj ). Again using LSE to find the optimal solution, by defining
If we know the conditional probability P (x | y) and class prior P (y), use (1.4) to compute    
φ0 (x(1) ) φ1 (x(1) ) . . . φk (x(1) ) — φ(x(1) )T —
P (x | y = i)P (y = i)  φ0 (x(2) ) φ1 (x(2) ) . . . φk (x(2) )   
P (y = i | x) = ∝ P (x | y = i)P (y = i) = qi (x) (2.1)   — φ(x(2) )T — 
P (x) Φ= . . . = , (3.4)
 .. .. ... ..   ...
and qi (x) to select the appropriate class. Choose class 0 if q0 (x) > q1 (x) and 1 otherwise. φ0 (x(n) ) φ1 (x(n) ) . . . φk (x(n) ) — φ(x(n) )T —
In general choose the class ĉ = arg max{qc (x)}. we then get
c
w = (ΦT Φ)−1 ΦT y. (3.5)
Because our decision is probabilistic, there is still chance for error. The Bayes error rate (risk)
of the data distribution is the probability an instance is misclassified by the Bayes decision rule. Diving in the Math 4 - LSE for general linear regression problem
Our goal is to minimize the following loss function: Equivalently, a function l(w) is concave on [w1 , w2 ] if
For binary classification, the risk for sample x is
X X X l(tx1 + (1 − t)x2 ) ≥ tl(x1 ) + (1 − t)l(x2 )
R(x) = min{P (y = 0 | x), P (y = 1 | x)}. (2.2) J(w) = (y (i) − wj φj (x(i) ))2 = (y (i) − wT φ(x(i) ))2 ,
i j i
for all x1 , x2 ∈ [w1 , w2 ] and t ∈ [0, 1]. If the sign is reversed, l is a convex function.
In other words, if P (y = 0 | x) > P (y = 1 | x), then we would pick the label 0, and the risk is
the probability that the actual label is 1, which is P (y = 1 | x). where w and φ(x(i) ) are vectors of dimension k + 1 and y (i) is a scalar.
We can also compute the expected risk - the risk for the entire range of values of x: Setting the derivative w.r.t w to 0:
Z
∂ X (i) X
E[r(x)] = r(x)P (x)dx 0= (y − wT φ(x(i) ))2 = 2 (y (i) − wT φ(x(i) ))φ(x(i) )T ,
∂w i
Zx i

= min{P (y = 0 | x), P (y = 1 | x)}dx which yields


x
Z Z X X
y (i) φ(x(i) )T = wT φ(x(i) )φ(x(i) )T .
= P (y = 0) P (x | y = 0)dx + P (y = 1) P (x | y = 1)dx, i i
L1 L0
Hence, defining Φ as in (3.4) would give us
where Li is the region over which the decision rule outputs label i.
The risk value we computed assumes that both errors (assigning instances of class 1 to 0 and (ΦT Φ)w = ΦT y ⇒ w = (ΦT Φ)−1 ΦT y
vice versa) are equally harmful. In general, we can set the weight penalty Li,j (x) for assigning

7 12 17

instances of class i to class j. This gives us the concept of a loss function To sum up, we have the following algorithm for the general linear regression problem. Diving in the Math 7 - Log likelihood of logistic regression is concave
(i) P (i)
Z Z For convenience we denote x0 = 1, so that w0 + di=j wi xj = wT x(i) .
E[L] = L0,1 (x)P (y = 0) P (x | y = 0)dx + L1,0 (x)P (y = 1) P (x | y = 1)dx. (2.3) Algorithm 6: (General linear regression algorithm) We first note the following lemmas:
L1 L0 Input: Given n input data {(x(i) , y (i) )}ni=1 where x(i) is 1 × m and y (i) is scalar, as well as
m basis functions {φj }m
j=1 , we find 1. If f is convex then −f is concave and vice versa.
2.2 Classification n
X 2. A linear combination of n convex (concave) functions f1 , f2 , . . . , fn with nonnegative
ŵ = arg min (y (i) − wT φ(x(i) ))2 coefficients is convex (concave).
There are roughly three types of classifiers: w
i=1
3. Another property of twice differentiable convex function is that the second derivative
1. Instance based classifiers: use observation directly (no models). Example: K nearest by the following procedure: is nonnegative. Using this property, we can see that f (x) = log(1 + exp x) is convex.
neighbor.
1. Compute Φ as in (3.4). 4. If f and g are both convex, twice differentiable and g is non-decreasing, then g ◦ f is
2. Generative: build a generative statistical model. Example: Naive Bayes.
convex.
2. Output ŵ = (ΦT Φ)−1 ΦT y.
3. Discriminative: directly estimate a decision rule/boundary. Example: decision tree.
Now we rewrite l(w) as follows:
The classification task itself contains several steps: n
X
3.3 Regularized least squares l(w) = y (i) wT x(i) − log(1 + exp(wT x(i) ))
1. Feature transformation: e.g, how do we encode a picture?
i=1
2. Model / classifier specification: What type of classifier to use? 3.3.1 Definition n
X n
X
= y (i) wT x(i) − log(1 + exp(wT x(i) ))
In the previous chapter we see that a linear regression problem involves solving (ΦT Φ)w = ΦT y
3. Model / classifier estimation (with regularization): How do we learn the parameters i=1 i=1
for w. If ΦT Φ is invertible, we would get w = (ΦT Φ)−1 ΦT y as in (3.5). Now what if ΦT Φ is not n
X n
X
of our classifier? Do we have enough examples to learn a good model?
invertible? = y (i) fi (w) − g(fi (w)),
4. Feature selection: Do we really need all the features? Can we use a smaller number and i=1 i=1

still achieve the same (or better) results? Recall that full rank matrices are invertible, and that
T (i)
where fi (w) = w x and g(z) = log(1 + exp z).
Classification is one of the key components of supervised learning, where we provide the algorithm rank(ΦT Φ) = the number of non-zero eigenvalues of ΦT Φ fi (w) is of the form Ax + b where A = x(i) and b = 0, which means it’s affine (i.e., both
with labels to some of the instances and the goal is to generalize so that a model / method can ≤ min(n, k) since Φ is n × k concave and convex). We also know that g(z) is convex, and it’s easy to see g is non-
be used to determine the labels of the unobserved examples. decreasing. This means g(fi (w)) is convex, or equivalently, −g(fi (w)) is concave.
In other words, ΦT Φ is not invertible if n < k, i.e., there are more features than data point. More To sum up, we can express l(w) as
specifically, we have n equations and k > n unknowns - this is an undetermined system of linear
2.3 K-nearest neighbors equations with many feasible solutions. In that case, the solution needs to be further constrained. l(w) =
n
X
y (i) fi (w) +
n
X
−g(fi (w)),
A simple yet surprisingly efficient algorithm is K nearest neighbors.
One way, for example, is Ridge Regression - using L2 norm as penalty to bias the solution to |i=1 {z } |i=1 {z }
concave concave
“small” values of w (so that small changes in input don’t translate to large changes in output):
Algorithm 4: (K-nearest neighbors)
hence l(w) is concave.
Given n data points, a distance function d and a new point x to classify, select the class of n
X
x based on the majority vote in the K closest points. ŵRidge = arg min (yi − xi w) + 2
λ kwk22
w
i=1 As such, it can be optimized by the gradient ascent algorthim.
Note that this requires the definition of a distance function or similarity measure between sam- = arg min(Φw − y)T (Φw − y) + λ kwk22 , λ≥0
w Algorithm 7: (Gradient ascent algorithm)
ples. We also need to determine K beforehand. Larger K means the resulting classifier is more
= (ΦT Φ + λI)−1 ΦT y. (3.6) Initialize: Pick w at random.
‘smooth’ (but smoothness is primarily dependent on the actual distribution of the data).
Gradient:  
We could also use Lasso Regression (L1 penalty) ∂E(w) ∂E(w) ∂E(w)
From a probabilistic view, KNN tries to approximate the Bayes decision rule on a subset of data. ∇w E(w) = , ,..., .
∂w0 ∂w1 ∂wd
We compute P (x | y), P (y) and P (x) for some small region around our sample, and the size of n
X
that region will be dependent on the distribution of the test sample. ŵLasso = arg min (yi − xi w)2 + λ kwk1 , (3.7) Update:
w
i=1
∆w = η∇w E(w)
which biases towards many parameter values being zero - in other words, many inputs become ∂E(w)
(t+1) (t)
irrelevant to prediction in high-dimensional settings. There is no closed form solution for (3.7), wt ← wi + η ,
∂wi
but it can be optimized by sub-gradient descent.
where η > 0 is the learning rate.

8 13 18
In this case our likelihood function is specified in (4.4), so we have the following steps for training Algorithm 13: (Backpropagation)
logistic regression: Denote the followings:

Algorithm 8: (Gradient ascent algorithm for logistic regression) • l is the index of the traning example
Initialize: Pick w at random and a learning rate η.
Update: • yk is target output (label) of output unit k

• Set an ǫ > 0 and denote In other words, the two methods have representation equivalence - in particular, a linear decision • oh , ok are unit output (obtained by forward propagation) of output units h, k. If i is
boundary. However, keep in mind that: input variable, oi = xi .
(t) Pd (t) (i)
exp(w0 + j=1 w j xj )
P̂ (y (i) = 1 | x(i) , w(t) ) = (t) Pd (t) (i)
. • This is only true in the special case where we assume the feature variances are independent • wij is the weight from node i to node j in the next layer.
1+ exp(w0 + j=1 wj xj ) of class label (more specifically, Σ1 = Σ0 in (5.3)).
Initialize all weights to small random numbers. Until satisfied, do
(t+1) (t) • Logistic regression is a discriminative model and makes no assumption about P (X | Y ) in
• Iterate until |w0 − w0 | < ǫ: • For each training example, do
learning. Instead, it assumes a sigmoid form for P (Y | X).
n h
X i
(t+1) (t) (i) (i) (i) (t) • The two optimize different functions: MLE / MCLE vs MAP / MCAP and yield different 1. Input the training example to the network and compute the network outputs,
w0 ← w0 +η y − P̂ (y = 1 | x ,w ) . using forward propagation.
i=1 solutions.
2. For each output unit k, let
(t+1) (t)
More generally, we outline the problem between a generative classifier (e.g., Naive Bayes) and a
• For k = 1, . . . , d, iterate until |wk − wk | < ǫ: discriminative classifier (e.g., logistic regression). (l) (l) (l) (l) (l)
δk ← ok (1 − ok )(yk − ok ).
n
X h i
wk
(t+1)
← wk + η
(t) (i)
xj y (i) − P̂ (y (i) = 1 | x(i) , w(t) ) . Generative Discriminatve 3. For each hidden unit h, let
i=1 Assume some prorability model for P (Y ) and Assume some functional form for P (Y | X) X
P (X | Y ). or the decision boundary. (l) (l)
δh ← oh (1 − oh )
(l) (l)
whk δk ,
Estimate parameters of probability models Estimate parameters of functional form di- k∈K
from training data. rectly from training data.
where K is the set of output nodes.
Table 5.1: Generative (model-based) vs Discriminative (model-free) classifier. 4. Update each network weight wij :
(l)
wij ← wij + ∆wik
(l) (l) (l)
where ∆wij = −ηδj oi .

Unlike in logistic regression, the function E(w) in neural network is not convex in w. Thus,
gradient descent (and backpropagation) will find a local, not necessarily global minimum, but
it often works well in practice.

Finally, we note the two ways in which backpropagation can be implemented in practice: Batch
mode and Incremental mode (also called Stochastic Gradient Descent). Incremental mode is
faster to compute and can approximate Batch mode arbitrary closely if η is small enough.

Batch mode Gradient mode


P
Let ED (w) = 12 l∈D (y (l) − o(l) )2 . Let El (w) = 12 (y (l) − o(l) )2 .
Do until satisfied: Do until satisfied:

1. Compute the gradient ∇ED (w). • For each training example l in D:

2. w ← w − η∇ED (w). 1. Compute the gradient ∇El (w).


2. w ← w − η∇El (w).

Table 6.1: Batch mode vs Gradient mode

19 24 29

6.3 Convolutional neural networks


Definition 8: (Deep architectures)
Deep architectures are composed of multiple levels of non-linear operations, such as neural
nets with many hidden layers.
Chapter 5 Chapter 6

Naive Bayes Classifier Neural Networks and Deep Learning

5.1 Gaussian Bayes 6.1 Definition


Recall that the Bayes decision rule (2.1) is the optimal classifier, but requires having knowledge Definition 7: (Neural network)
of P (Y | X) = P (X | Y )P (Y ), which is difficult to compute. To tackle this problem, we consider Given a function f : X → Y such that:
appropriate models for the two terms: class probability P (Y ) and class conditional distribution
• f can be non-linear Deep learning methods aim at learning feature hierarchies, where features from the higher levels
of features P (X | Y ).
of the hierarchy are formed by lower level features. One such method is convoluational neural
• X is vector of continuous / discrete variables network, which, compared to standard feedforward neural network:
Consider an example of X being continuous 1-dimensional and Y being binary. The class
probability P (Y ) then follows a Bernoulli distribution with parameter θ, i.e., P (Y = 1) = θ. • Y is vector of continuous / discrete variables • have much fewer connections and parameters
We further assume the Gaussian class conditional density
  A neural network is a way to represent f by a network of logistic / sigmoid unit. • is easier to train
1 (x − µi )2
P (X = x | Y = i) = p exp − , (5.1) • has only slightly worse theoretically best performance.
2πσy2 2σi2
First, we define the term convolution.
where we note that the distribution would be different for each class (hence the notation µi , σi ).
In total there are 5 parameters: θ, µ0 , µ1 , σ0 , σ1 . Definition 9: (Convolution)
In case X is 2-dimensional, we would have The convolution of two functions f and g, denoted f ∗ g, is defined as:
 
1 (x − µi )T Σ−1
i (x − µi ) • If f and g are continuous:
P (X = x | Y = i) = p exp − , (5.2)
2π|Σi | 2 Z Z
∞ ∞
(f ∗ g)(t) = f (τ )g(t − τ )dτ = f (t − τ )g(τ )dτ. (6.5)
where Σi is a 2×2 covariance matrix. In total there are 11 parameters: θ, µ0 , µ1 (2-dimensional), −∞ −∞
Σ0 , Σ1 (2 × 2 symmetric matrix).
• If f and g are discrete:
Further note that the decision boundary in this case is

X ∞
X
In its simplest form, a neural network has one input node and one output node (f ∗ g)[n] = f [m]g[n − m] = f [n − m]g[m]. (6.6)
P (Y = 1 | X = x) P (X = x | Y = 1)P (Y = 1)
= m=−∞ m=−∞
P (Y = 0 | X = x) P (X = x | Y = 0)P (Y = 0) w
s   x o
|Σ0 | (x − µ1 )T Σ−1
1 (x − µ1 ) (x − µ0 )T Σ−1
0 (x − µ0 ) θ • If discrete g has support on {−M, . . . , M }:
= exp − + · . (5.3)
|Σ1 | 2 2 1−θ
so it is equivalent to logistic regression M
X
This implies a quadratic equation in x, but if Σ1 = Σ0 the quadratic terms cancel out and the 1 (f ∗ g)[n] = f [n − m]g[m]. (6.7)
equation is linear. o(x) = σ(wx) = . (6.1) m=−M
1 + e−wx

The number of parameters we would need to learn for Gaussian Bayes in the general case, with On the other hand, a neural network with one hidden layer Informally, the convolution gives a sense of how much two functions “overlap.” A simple con-
k labels and d-dimensional inputs, is: w0 wh volutional neural network (CNN) is a sequence of layers, each of which transforms one volume
x oh o of activations to another through a differentiable function. We use three main types of layers to
• P (Y = i) = pi for 1 ≤ i ≤ k − 1: k − 1 parameters, and build the network architectures: Convolutional Layer, Pooling Layer, and Fully Connected Layer

20 25 30

d(d+1)
• P (X = x | Y = i) ∼ N (µi , Σi ): d parameters for µi and 2
parameters for Σi , would output X X (exactly as seen in regular Neural Networks). Typically, CNNS are used for image classification
kd(d+1)
o(x) = σ(w0 + wh σ(w0h + wih xi )). (6.2) in computer vision.
which result in kd + = O(kd2 ) parameters.
2 h
| {zi }
oh We now discuss the components of a CNN. The following text excerpts are from Chapter 10 of
If X is discrete, again with k labels and d-dimensional inputs, the number of parameters is: Artificial Intelligence for Humans, Vol 3: Neural Networks and Deep Learning and Stanford’s
More generally, prediction in neural networks is done by starting from the input layer and, for
CS 231n.
• P (Y = i) = pi for 1 ≤ i ≤ k − 1: k − 1 parameters, and each subsequent layer, computing the output of the sigmoid units (forward propagation).

• P (X = x | Y = i) comes from a probability table with 2d − 1 entries, 6.3.1 Convolutional Layer


6.2 Training a neural network
which result is k(2d − 1) parameters to learn. The primary purpose of a convolutional layer is to detect features such as edges, lines, blobs of
6.2.1 Gradient descent color, and other visual elements. Each feature is represented by a filter; the more filters that we
Having too many parameters means we need a lot of training data to learn them. We therefore give to a convolutional layer, the more features it can detect.
introduce an assumption that can significantly reduce the number of parameters. Let’s treat a neural network as a function Y = fw (X) + ǫ where fw is determined given w and
ǫ ∼ N (0, σ 2 I), so Y ∼ N (fw (X), σ 2 I). One way to learn the weights is by MCLE: More formally, a filter is a square 2D matrix that scans over the image. The convolutional layer,
! which is essentially a set of filters, acts as a smaller grid that sweeps left to right over each of
5.2 Naive Bayes Classifier Y row of the image. Each cell in a filter is a weight.
wbM CLE = arg max ln P (y (l) | x(l) , w)
w
l
The sweeping phase (forward propagation) is done as follows. First, the input image may be
Definition 6: (Naive Bayes Classifier) X padded with some layers of zero cells as need. The stride specifies the number of positions at
The Naive Bayes Classifier is the Bayes decision rule with an additional “naive” assumption = arg min (y (l) − fbw (x(l) ))2 . which the convolutional filters will stop. The convolutional filters move to the right, advancing
w
that features are independent given the class label: l
by the number of cells specified in the stride. Once the far right is reached, the convolutional
d In other words, the weights are trained to minimize sum of squared errors of predicted network filter moves back to the far left, then it moves down by the stride amount and continues to the
Y right again. Hence, the number of steps in one sweeping phase is
P (X | Y ) = P (X1 , X2 , . . . , Xd | Y ) = P (Xi | Y ). (5.4) outputs. We may also want to restrict the weights to small values, and this bias can be encoded
i=1 in MCAP:
W − F + 2P
! Number of steps = + 1,
In this case, the output class of an input x is Y S
w
bM CAP = arg max ln P (w) P (y (l) | x(l) , w)
d
w
l where W is the image size, F is the filter size, P is the padding and S is the stride.
Y X X
fˆN B (x) = arg max P (x1 , . . . , xd | y)P (y) = arg max P (xi | y)P (y). = arg min c wi2 + (y (l) − fbw (x(l) ))2 ,
y y
i=1 w We can use the same set of weights as the convolutional filter sweeps over the image. This
i l
process allows convolutional layers to share weights and greatly reduce the amount of processing
T 2 needed. In this way, we can recognize the image in shift positions because the same convolutional
Therefore, if the conditional assumption holds, Naive Bayes is the optimal classifier. Using this where P (w) ∝ exp(−w w/2τ ) (recall the connection between Ridge Regression and MCAP
(3.10)). In other words, the weights are trained to minimize sum of squared errors of predicted filter sweeps across the entire image.
assumption, we can then formulate the Naive Bayes classifier algorithm, where the class priors
and class conditional probabilities are estimated using MLE. network outputs plus weight magnitudes.
The input and output of a convolutional layer are both 3D boxes. For the input to a convolu-
Algorithm 9: (Naive Bayes Classifier for discrete features) To perform the above optiminization, we introduce a routine called gradient descent. tional layer, the width and height of the box is equal to the width and height of the input image.
(i) (i) (i) The depth of the box is equal to the color depth of the image. For an RGB image, the depth
Given training data {(x(i) , y (i) )}ni=1 where x(i) = (x1 , x2 , . . . , xd ), we compute the follow-
Algorithm 12: (Gradient descent algorithm) is 3, equal to the components of red, green, and blue. If the input to the convolutional layer
ings:
Initialize: Pick w at random. is another layer, then it will also be a 3D box; however, the dimensions of that 3D box will be
P 
• Class prior: P̂ (Y = y) = n1 ni=1 I y (i) = y . Gradient:   dictated by the hyper-parameters of that layer. Like any other layer in the neural network, the
∂E(w) ∂E(w) ∂E(w) size of the 3D box output by a convolutional layer is dictated by the hyper-parameters of the
∇w E(w) = , ,..., .
• Class joint distribution: ∂w0 ∂w1 ∂wd layer. The width and height of this box are both equal to the filter size. However, the depth is
Update: equal to the number of filters.
1 X  (i) 
n
P̂ (Xj = xj , Y = y) = I xj = xj , y (i) = y .
n i=1 ∆w = −η∇w E(w) A visualization for the sweeping phase is in the Convolution Demo section at Stanford’s CS
(t+1) (t) ∂E(w) 231n. In summary, the procedure taking place at the convolutional layer is as follows.
• Prediction for test data given input x: wt ← wi − η ,
∂wi
Pn  
(i) (i) where η > 0 is the learning rate.
i=1 I xj = xj , y = y
d
Y P̂ (xj , y)
fˆN B (x) = arg max P̂ (y) = arg max Pn (i) = y)
. (5.5)
y P̂ (y) y i=1 I (y
j=1 Suppose we feed the training data {(x(l) , y (l) )}nl=1 where x(l) is d-dimensional into a one-layer
neural network with d input nodes xi and one output node o. The gradient descent w.r.t every

21 26 31

Algorithm 10: (Naive Bayes Classifier for continuous features) output weight wi is Algorithm 14: (Convolutional layer’s forward pass)
X n
(i) (i) (i)
Given training data {(xi , yi )}ni=1 where x(i) = (x1 , x2 , . . . , xd ), we compute the followings: ∂E (l) Accepts a 3D image of size W1 × H1 × D1 .
= (o(l) − y (l) )o(l) (1 − o(l) )xi . (6.3)
P  ∂wi l=1
Requires four hyperpameters: number of filters K, spatial extent F , stride S, and amount
• Class prior: P̂ (Y = y) = n1 ni=1 I y (i) = y . of zero padding P .
Produces a 3D image of size W2 × H2 × D2 where
• Class conditional distribution: Diving in the Math 9 - Gradient descent for weights at output layer
n W1 − F + 2P H1 − F + 2P
1 X (i)  W2 = + 1; H2 = + 1; D2 = K. (6.8)
µ̂jy = Pn (i) = y)
xj I y (i) = y , S S
i=1 I (y i=1
x1
n With parameter sharing, it introduces F × F × D1 weights per filter, for a total of F × F ×
1 X (i)  w1
2
σ̂jy = Pn (x − µ̂jy )2 I y (i) = y , D1 × K weights and K biases. In the output volume, the d-th depth slice (of size W2 × H2 )
I (y (i) = y) − 1 i=1 j
i=1 Pd 1 is the result of performing a valid convolution of the d-th filter over the input volume with
  w2 P net = i=1 w i xi o = σ(net) = 1+e−net
1 (x − µ̂jy )2 x2 σ o a stride of S, and then offset by d-th bias.
P̂ (Xj = xj | Y = y) = 2 √ exp − 2
.
σ̂jy 2π 2σ̂jy wd
...
6.3.2 Pooling layer
• Prediction for test data given input x:
xd
Pooling layer downsamples an input 3D image to a new one with smaller widths and heights.
d
Y Typically a pooling layer follows immediately after a convolutional layer. A typical choice for
fˆN B (x) = arg max P̂ (y) P (xj | y). (5.6) pooling is max-pooling, which, for every f × f region in the input image, outputs the maximum
y Note that σ(x) = 1+e1−x is the sigmoid function and d
dx
σ(x) = σ(x)(1 − σ(x)).
j=1 P number in that region to the output image.
We have E(w) = 12 nl=1 (y (l) − o(l) )2 so
As we noted earlier, the number of parameters in the above algorithms is much fewer. ∂E 1X ∂
n
1 X ∂(y (l) − o(l) )2 ∂o(l)
n Algorithm 15: (Pooling layer’s forward pass)
= (y (l) − o(l) )2 = · Accepts a volume of size W1 × H1 × D1 .
Diving in the Math 8 - Number of parameters in Naive Bayes ∂wi 2 l=1 ∂wi 2 l=1 ∂o(l) ∂wi
Requires two hyperparameters: spatial extent F and stride S, Produces a volume of size
Consider input variable X with discrete features X1 , . . . , Xd each taking one of K values n
X ∂o(l) W2 × H2 × D2 where:
and output label Y taking one of M values. Further suppose that the label distribution is = (o(l) − y (l) ) · .
l=1
∂wi W1 − F H1 − F
Bernoulli and the feature distribution conditioned on the label is multinomial. W2 = + 1; H2 = + 1; D2 = D1 . (6.9)
S S
Now note that
Without naive Bayes assumption: ∂o(l) ∂o(l) ∂net(l) (l) Introduces zero parameters since it computes a fixed function of the input. For Pooling
= · = o(l) (1 − o(l) )xi ,
∂wi ∂net(l) ∂wi layers, it is not common to pad the input using zero-padding.
• M − 1 parameters p0 , p1 , . . . , pM −2 for label:
so
 X n
if y < M − 1 ∂E
 py
 It is worth noting that there are only two commonly seen variations of the max pooling layer
(l)
= (o(l) − y (l) )o(l) (1 − o(l) )xi .
P (Y = y) =
MX−2 ∂wi l=1 found in practice: A pooling layer with F = 3, S = 2 (also called overlapping pooling), and more

1 − pi if y = M − 1 commonly F = 2, S = 2.
i=0
∂E
For the hidden layers, consider ∂w where whk connects a node oh from layer L to ok from layer
• For each label y, the corresponding probability table has K n − 1 parameters.
hk
L + 1. Further assume that Γ is the set of all nodes at layer L + 2. We then have the expression 6.3.3 Fully Connected Layer
∂E X This is the same layer from a feedforward neural network, with two hyperparameters: neuron
So the total number of parameters is M − 1 + M (K n − 1) = M · K n − 1 . = oh ok (1 − ok ) δγ wkγ. , (6.4) count and activation function. Every neuron in the the previous layer’s 3D image output is
∂whk γ∈Γ connected to each neuron in this layer through a weight value. A dot product of the input
With naive Bayes assumption:
∂E
(flattened to 1D) and the weight vector is then passed to the activation function. Dense layers
where δγ = oγ (1 − oγ ) ∂o . This means that we can first compute the gradients w.r.t the weights can employ many different kinds of activation functions, such as ReLU, sigmoid or tanh.
• M − 1 parameters p0 , p1 , . . . , pM −2 for labels, as mentioned above. γ
at the output layer, then propagate back to the weights at one previous layer, then two previous
• For each Y = y and 1 ≤ i ≤ n, Xi | Y comes from a categorial distribution so it has layers, and so on, until the input layer.
K − 1 parameters. So we need M n(K − 1) parameters for

{P (Xi = xi | Y = y) | i = 1..n, xi = 1..K, y = 0..M − 1}.

The total number of parameters is M − 1 + M n(K − 1) . Therefore the number of param-


eters is reduced from exponential to linear with the Naive Bayes assumption.

22 27 32

We note two important issues in Naive Bayes. First, the conditional independence assumption Diving in the Math 10 - Gradient descent for weights at hidden layers
does not always hold; nevertheless, this is still the most used classifier, especially when data We would like to evaluate
is limited. Second, in practice we typically use MAP estimates for the probabilities instead of ∂E ∂E ∂ok
= · .
MLE since insufficient data may cause MLE to be zero. For instance, if we never see a training ∂whk ∂ok ∂whk
instance where X1 = a and Y = b then, based on MLE, P̂ (X1 = a | Y = b) = 0. Consequently, Since ∂ok
= ok (1 − ok )oh , we can rewrite the above expression as
∂whk
no matter what the values X2 , . . . , Xd take, we see that
∂E ∂E
Chapter 7
d
Y = oh ok (1 − ok ) = oh δk .
P̂ (X1 = a, X2 , . . . , Xd | Y ) = P̂ (X1 = a | Y ) P̂ (Xj | Y ) = 0. ∂whk ∂ok
j=2
Now we consider Support Vector Machine
To resolve this issue, we can use a technique called smoothing and add m “virtual” data points. ∂E X ∂E ∂oγ X ∂E X
= · = oγ (1 − oγ )wkγ = δγ wkγ .
∂ok γ
∂oγ ∂ok γ
∂oγ γ
Algorithm 11: (Naive Bayes Classifier for discrete features with smoothing) 7.1 Introduction
(i) (i) (i) Hence
Given training data {(xi , yi )}ni=1 where x(i) = (x1 , x2 , . . . , xd ), assume some prior dis- X
∂E
tributions (typically uniform) Q(Y = b) and Q(Xj = a, Y = b). We then compute the = oh ok (1 − ok ) δγ wkγ. Recall that a regression classifier with linear decision boundary would typically look like
∂whk
followings: γ

P 
• Class prior: P̂ (Y = y) = n1 ni=1 I y (i) = y .

• Class conditional distribution: 6.2.2 Backpropagation


Pn  
(i) More generally, we can derive the backpropagation algorithm as in the next page. We also note
i=1 I xj = xj , y (i) = y + mQ(Xj = xj , Y = y)
P̂ (Xj = xj | Y = y) = Pn . the special properties of this algorithm:
(i) = y) + mQ(Y = y)
i=1 I (y
• It computes gradient descent over the entire network weight vector. Training can take
• Prediction for test data given input x: thousands of iterations, which is slow, but using network after training is very fast.

d
Y • It can easy generalize to arbitrary directed graphs.
fˆN B (x) = arg max P̂ (y) P (xj | y). (5.7)
y • It will find a local, not necessarily global error minimum (because error function E is no
j=1
longer convex in weights), but often works well in practice.

• It often includes a weight momentum α:


5.3 Text classification: bag of words model (n) (n−1)
where the boundary is determined by taking into account all data points (e.g., linear regression
∆wij = ηδj xij + α∆wij . (3.1)). In this case, the decision line is closer to the blue nodes since many of them are far
We present a case study of Naive Bayes Classifier: text classification. Given the input text as off to the right. However, note that there could be many possible classifiers that have different
a string of words, we have to predict a category for it (i.e., what is the topic, is this a spam The expressive capabilities of neural network are powerful: boundaries but yield the same outcome:
email). Using the bag of words approach, we construct the input features X as the count of
how many times each word appears in the document. The dimension d of X is then the number • Every boolean function can be represented by a network with one hidden layer, but may
of distinct words in the document, which can be very large, leading to a huge probability table require exponential (in number of inputs) hidden units.
for P (X | Y ) (with 2d − 1 entries). However, under the Naive Bayes assumption, P (X | Y ) is
• Every bounded continuous function can be approximated with arbitrarily small error, by
simply a product of probability of each word raised to its count. It follows that
a network with one hidden layer.
wk
Y
fˆN B (x) = arg max P (y) P (w | Y )count(w) , (5.8) • Any function can be approximated to arbitrary accuracy by a network with two hidden
y
w=w1 layers.

where w1 , . . . , wk are all the distinct words in the document. Note that here we assume the order However, neural network may still encounter the issue of overfitting, which can be avoided by
of words in the document doesn’t matter, which sounds silly but often works very well. MCAP, early stopping, or by regulating the number of hidden units, which essentially prevents
overly complex models.

5.4 Generative vs Discriminative Classifer


One way to decide on a single classifier is to find a max margin classifier: a boundary that leads
We first note the following comparison between Naive Bayes and Logistic Regression: to the largest margin from both sets of points. This also means that instead of fitting all points,

23 28 33
we may consider only the “boundary” points, and we want to learn a boundary that leads to the
largest margin from points on both sides. These boundary points are called the support vectors.

In particular, we would specify a max margin classifier based on parameters w and b, and then
perform classification as follows

In general, if the data is mapped into sufficiently high dimension, then samples will be linearly
separable. n data points can be separable in a space of n − 1 dimensions or more. However, this
transformation poses two difficulties:

• High computation burden due to high-dimensionality.

• Many more parameters.

SVM solves these two issues by:

• Using dual formulation, which only assigns parameters to samples, not features (i.e., each
x(i) has an associated αi ).
7.2 Primal form
• Using kernel tricks for efficient computation.
7.2.1 Linearly separable case
Our goal, as previously mentioned, is to find the maximum margin. Let’s define the width of
the margin by M , we can then see that
2
M=√ . (7.1)
wT w

34 39 44

Diving in the Math 11 - Computing margin M in terms of weight w and bias b


First observe that the vector w is orthogonal to the +1 plane. To prove this, let u and v be
any two points on the +1 plane then

wT u = wT v = 1 − b ⇒ wT (u − v) = 0.

Similarly, w is orthogonal to the −1 plane too. Hence, if x+ is a point on the + plane and
x− is the point closest to x+ on the − plane, then the vector from x+ to x− is parallel to w.
In other words, x+ = λw + x− for some λ. Now we have

w T x+ + b = 1
wT (λw + x− ) + b = 1 Note that if we have n data points with m dimensions then the number of parameters is m in
primal form (i.e., the features of weight w) and n in dual form (i.e., an αi for each x(i) ). At first
wT x− + λwT w = 1 glance, because n ≫ m, the primal formation is at an advantage. However, note that in dual
2 form we only care about the support vectors; in other words, the parameters are only those αi
λ= T .
w w that are positive, and their number is usually a lot less than n. Hence, the dual form is not worse
Hence than primal form in the original space. In the transformed space, as x increases in dimension,
2 √ T 2 √ so does w, so the primal form requires more parameters, while the dual form generally also sees
+ −
M = |x − x | = |λw| = λ wT w = T w w=√ .
w w wT w an increase in the number of support vectors, but not as much.

We can now search for the optimal parameters by finding a solution that: 7.3.3 Kernel tricks 8.2 Mathematical details
1. Correctly classifies all points While working in higher dimensions is beneficial, it also increases our run time because of the We now examine at each of the above formulas to see how they were derived. First observe that:
dot product computation. However, there is a neat trick we can use.
2. Maximizes the margin (or equivalently minimizes wT w). Consider, for example, all quadratic terms for the features x1 , x2 , . . . , xm : • ǫt is the fraction of misclassified (weighted) samples in iteration t. Recall our earlier
√ √ √ √ definition of a weak learner
 as
 one that has error < 50%. Hence we always have ǫt < 0.5
Several optimization methods can be used: Gradient descent, simulated annealing, EM, etc. In φ(x) = (1, 2x1 , . . . , 2xm , x21 , . . . , x2m , 2x1 x2 , . . . , 2xm−1 xm )T . (7.9)
this case, the problem also belongs to the category of quadratic programming (QP). | {z } | {z } | {z } and therefore αt = 12 ln 1−ǫ
ǫt
t
> 0.
m+1 linear terms m quadratic terms m(m−1)/2 pairwise terms

Definition 10: (Quadratic programming) • If sample i is classified correctly then y (i) ht (x(i) ) = 1 so
The dot product operation would normally be
Quadratic programming solves the optimization problem
X X X Dt (i)
φ(x)T φ(z) = 2xi zi + (xi )2 (zi )2 + 2xi xj zi zj + 1, Dt (i) exp(−αt y (i) ht (x(i) )) = < Dt (i),
uT Ru exp(αi )
min + dT u + c, i i i<j
u 2 i.e., the weight of this sample decreases. On the other hand, if sample i is classified
where u is a vector, R is square matrix, d is vector and c is scalar. which has O(m2 ) operations. However, we can obtain dramatic savings by noting that incorrectly, then its weight increases.
Furthermore, u is subject to n inequality constraint
(x · z + 1)2 = (x · z)2 + 2(x · z) + 1 In other words, in each iteration the classifier will focus on a different set of points (more
X X
ai1 u1 + ai2 u2 + . . . ≤ bi , 1 ≤ i ≤ n, =( xi z i ) 2 + 2xi zi + 1 specifically, those that were misclassified in the previous iteration), and in the end, we hope that
i i
the combination of these classifiers (over all iterations) can classify all points correctly. This is
and k equivalency constraint X X X the idea behind boosting. Now, to prove that it does work (i.e., the error converges to 0 after
= 2xi zi + (xi )2 (zi )2 + 2xi xj zi zj + 1
some number of iterations), we answer the following questions.
i i i<j
aj1 u1 + aj2 u2 + . . . = bn+j , 1 ≤ j ≤ k.
= φ(x)T φ(z).
8.2.1 What αt to choose for hypothesis ht ?
More specifically, we can frame the margin maximization problem as a QP problem: In other words, to compute φ(x)T φ(z), we can simply compute x · z + 1 (which only needs m
Note that the training error of the final classifier H is bounded by
operations) and then square it. Hence, we don’t need to work directly with the transformations
wT w
min φ(x) to compute their dot products. The function φ in this case (7.9) is called a polynomial m m T
w 2 1 X  1 X Y
kernel (of degree 2). I H(x(i) ) 6= y (i) ≤ exp(−y (i) f (x(i) )) = Zt (8.1)
m i=1 m i=1
subject to n constraints if there are n samples x. The kernel trick works for higher order polynomials as well. In general, a polynomial of degree t=1
d can be computed by (x · z + 1)d . Beyond polynomials there are other very high dimensional PT
• wT x + b ≥ 1 for all x in class +1, basis functions that can be made practical by where f (x) = αt ht (x) and H(x) = sign (f (x)).
 2
 finding the right kernel function, such as Radial t=1

• wT x − b ≤ −1 for all x in class −1, basis kernel function K(x, z) = exp − (x−z)
2σ 2
.

35 40 45

7.2.2 Non linearly separable case 7.3.4 Non linearly separable case Diving in the Math 12 - Upper bound of final classifier’s training error
So far we have assumed that the data is linearly separable, i.e., there is a line wT x + b that Using the Lagrange multiplier method on the optimization function for the primal form’s non To prove the first inequality, note that:
perfectly separates the +1 and −1 class. But this is usually not the case in practice, as there linearly separable case (Table 7.1), we obtain our dual target function • Each correctly classified sample i contributes 0 to the LHS and 1
to the RHS.
me
can be noise and outliers. One way to address this is to penalize the number of misclassified X
points m, i.e., 1X • Each incorrectly classified sample i contributes m1 to the LHS and me to the RHS.
max αi − αi αj y (i) y (j) (x(i) )T (x(j) ) (7.10)
min wT w + C · m, α
i
2 i,j  P 
w In other words, I H(x(i) ) 6= y (i) < exp(−y (i) f (x(i) )) for all i, so i I H(x(i) ) 6= y (i) <
where C is a regularization constant. However, this is hard to encode in a QP problem. P P (i) (i)
subject to i αi y (i) = 0 and C > αi ≥ 0 (so now the αi ’s are bounded above by the regularization i exp(−y f (x )).
Instead of minimizing the number of misclassified points we can minimize the distance between To prove the second inequality, note that the definition of Dt gives us
constant). To evaluate a new sample x, we similarly perform the computation as in (7.8).
these points and their correct plane. In this case, the new optimization problem is
In summary, we have two optimization problems for two cases:
wT w Xn exp(−αT y (i) hT (x(i) ))
min +C ǫi , DT +1 (i) = DT (i) ·
w 2 Separable case Non-separable case ZT
i=1
Find Find exp(−αT −1 y (i) hT −1 (x(i) )) exp(−αT y (i) hT (x(i) ))
subject to 2n constraints if there are n samples x(i) : = DT −1 (i) · ·
X 1X X 1X ZT −1 ZT
• wT x(i) + b ≥ 1 − ǫi for all x(i) in class +1 max αi − αi αi y (i) y (j) (x(i) )T x(j) max αi − αi αi y (i) y (j) (x(i) )T x(j) = ...
α
i
2 i,j α
i
2 i,j P (i)
• w xT (i)
+ b ≤ −1 + ǫi for all x (i)
in class −1. exp(− αt y
tQ ht (x(i) )) exp(−y (i) f (x(i) ))
P P = = Q .
where i αi y (i) = 0 and αi ≥ 0 for all i. where i αi y (i) = 0 and C > αi ≥ 0 for all i. m t Zt m t Zt
• ǫi ≥ 0 for all i.
In summary, we have two optimization problems for two cases: On the other hand, because we define the Zt ’s as normalization factors,
Table 7.2: Optimization constraints for separable and non-separable case in dual SVM.
m
X m
X
Separable case Non-separable case 1
Find Find 1= DT +1 (i) = Q exp(−y (i) ht (x(i) )),
i=1
m t Zt i=1
wT w wT w X n 7.4 Other topics
min min +C ǫi which leads to
w 2 2 m T
w
i=1 7.4.1 Why do SVMs work? 1 X Y
exp(−y (i) f (x(i) )) = Zt .
subject to m i=1
subject to If we are using huge features spaces (with kernels) why are we not overfitting the data? t=1

• wT x + b ≥ 1 for all x Q
• wT x+b ≥ 1−ǫi for all • Number of parameters remains the same (and most are set to 0). In other words, to guarantee low error, we just need to make sure its upper bound t Zt is small.
in class +1
xi in class +1 We can tighten this bound greedily by choosing αt on
• wT x + b ≤ −1 for all • While we have a lot of input values, at the end we only care about the support vectors Peach iteration to minimize Zt .
• wT x + b ≤ −1 + ǫi for To do so, let’s define the error at iteration t as ǫt = i Dt (i)I ht (x(i) ) 6= y (i) .
x in class −1 and these are usually a small group of samples.
all xi in class −1 It then follows that
• The minimization (or the maximizing of the margin) function acts as a sort of regularization X
• ǫi ≥ 0 for all i. term leading to reduced overfitting. Zt = Dt (i) exp(−αt y (i) ht (x(i) )) = (1 − ǫt ) exp(−αt ) + ǫt exp(αt ).
t
 
Table 7.1: Optimization constraints for separable and non-separable case in primal SVM 7.4.2 Multi-class classification with SVM We can then choose αt that minimizes Zt by solving ∂Z t
∂αt
= 0, which yields αt = 12 ln 1−ǫ
ǫt
t
.
If we have data from more than two classes, most common solution is the one-versus-all approach: Further note that αt and ǫt are negatively correlated,
Q so intuitively αt is the “strength” of
classifier ht . Hence, we have shown how to minimize t Zt , but how small can this minimum
7.3 Dual representation • Create a classifer for each class against all other data. value be?
• For a new point use all classifiers and compare the margin for all selected classes. The
7.3.1 Linearly separable case class with the largest margin is selected. 8.2.2 Show that training error converges to 0
Instead of solving the QPs in Table 7.1 directly, we will solve a dual formulation of the SVM We can further derive another upper bound for the training error:
Note that this is not necessarily valid since this is not what we trained the SVM for, but often
optimization problem. The main reason for switching to this type of representation is that it !
works well in practice. m T
would allow us to use a neat trick that will make our lives easier (and the run time faster). 1 X  Y X
Starting from the separable case, note that we can rephrase the constraints as err(H) = I H(x(i) ) 6= y (i) ≤ Zt ≤ exp −2 γt2 (8.2)
m i=1 t=1 t
(wT x(i) + b)y (i) ≥ 1 (7.2)
1
where γt = − ǫt .
for all i, where yi - the class of xi - is ±1. We can then encoding this as part of our minimization 2

problem using Lagrange multiplier.

36 41 46

Algorithm 16: (Lagrange multiplier method) Diving in the Math 13 - Convergence of AdaBoost’s training error
Consider a problem of finding Note that   r
1 1 − ǫt 1 − ǫt
αt = ln ⇒ exp(αt ) = ,
min f (w) 2 ǫt ǫt
w
such that hi (w) = 0, i = 1, . . . , l and also X X
Chapter 8 ǫt = Dt (i); 1 − ǫt = Dt (i).
In the Lagrange multiplier method, we can define the Lagrangian to be y (i) 6=ht (x(i) ) y (i) =ht (x(i) )

l
X We now have
L(w, β) = f (w) + βi hi (w), Ensemble Methods and Boosting Zt =
X
Dt (i) exp(−αt y (i) ht (x(i) ))
i=1
i
X X
where the βi ’s are called the Lagrange multipliers. We would then find and set L’s partial = Dt (i) exp(−αt y ht (x )) + (i) (i)
Dt (i) exp(−αt y (i) ht (x(i) ))
derivatives to zero:
∂L
= 0;
∂L
= 0,
8.1 Introduction y (i) 6=ht (x(i) )
r −y(i) ht (x(i) )
y (i) =ht (x(i) )
r −y(i) ht (x(i) )
∂wi ∂βi X 1 − ǫt X 1 − ǫt
Consider the simple (weak) learners (e.g., naive Bayes, logistic regression, decision tree) - those = Dt (i) + Dt (i)
and solve for w and β. ǫt ǫt
that don’t learn too well but still better than chance, i.e. error < 50% but not close to 0. They y (i) 6=ht (x(i) ) y (i) =ht (x(i) )
More generally, consider the following primal optimization problem: r  r 
are good (low variance, usually don’t overfit) but also bad (high bias, can’t solve hard problems). X 1 − ǫt X ǫt
Can we somehow improve them by combining them together? A simple approach is “bucket of = Dt (i) + Dt (i)
min f (w) ǫt 1 − ǫt
w models”: y (i) 6=ht (x(i) ) y (i) =ht (x(i) )
   
such that gi (w) ≤ 0, i = 1, . . . , k r r
1 − ǫt  X ǫt X
hi (w) = 0, i = 1, . . . , l • Input: = · Dt (i) + · Dt (i)
ǫt 1 − ǫt
– Your top T favorite learners L1 , . . . , LT y (i) 6=ht (x(i) ) y (i) =ht (x(i) )
To solve it, we start by defining the generalized Lagrangian: r r
– A dataset D 1 − ǫt ǫt
= · ǫt + · (1 − ǫt )
k
X l
X ǫt 1 − ǫt
L(w, α, β) = f (w) + αi gi (w) + βi hi (w) • Learning algorithm: p
=2 ǫt (1 − ǫt ).
i=1 i=1
1. Use 10-fold cross validation to estimate the error of L1 , . . . , LT
where the αi , βi ’s are the Lagrange multipliers. Furthermore, for any x ∈ R, 1 − x ≤ exp(−x). Substitute x = 4γt2 we see that
2. Pick the best (lowest 10-CV error) learner L∗ p
3. Train L∗ on D and return its hypothesis h∗ 1 − 4γt2 ≤ exp(−4γt2 ) ⇒ 1 − 4γt2 ≤ exp(−2γt2 ).
Using the Lagrange multiplier, consider the quantity
This approach is simple and will give results not much worse than the best of the “base learners”, Now we have p p
θp (w) = max L(w, α, β) Zt = 2 ǫt (1 − ǫt ) = 1 − 4γt2 ≤ exp(−2γt2 ),
α,β:αi ≥0 but what if there’s not a single best learner? How do we come up with a method that combines
multiple classifiers? One way is to perform voting (ensemble methods): hence
then our minimization problem becomes T
Y T
Y X
• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at err(H) ≤ Zt ≤ exp(−2γt2 ) = exp(−2 γt2 ).
min θp (w) = min max L(w, α, β). different parts of the input space. t=1 t=1 t
w w α,β:αi ≥0

Specifically, our original problem is • Output class: (Weighted) vote of each classifier PT
It then follows that, as the number of iterations T increases, exp(−2 t=1 γt2 ) decreases expo-
T – Classifiers that are most “sure” will vote with more conviction nentially, so the training error also approaches 0 exponentially fast.
w w
min – Each classifier will be most “sure” about a particular part of the space
2 w

such that gi (w) = −y (i) (wT x(i) + b) + 1 ≤ 0. – On average, do better than single classifier!

which can be translated to The question, then, is how we can force classifiers to learn about different parts of the input
space and weigh the votes of different classifiers? This leads us to the idea of boosting:
wT w X
min max − αi (y (i) (wT x(i) + b) − 1) (7.3)
w α 2 • Idea: given a weak learner, run it multiple times on (reweighted ) training data, then let
i
| {z } the learned classifiers vote.
L(w,α)
• On each iteration t:

37 42 47

where αi ≥ 0 for all i. Setting the derivative of L w.r.t w, α, b respectively we get: – weigh each training example by how incorrectly it was classified
X n X n – learn a hypothesis ht and its strength αt
∂L
0= =w− αi y (i) x(i) ⇒ w = αi y (i) x(i) . (7.4)
∂w i=1 i=1 • Final classifier: a linear combination of the votes of the different classifiers weighted by
∂L their strength
0= = −y (i) (wT x(i) + b) − 1 ⇒ b = y (i) − wT x(i) for i where αi > 0. (7.5)
∂αi
n Note the notion of a weighted dataset - in particular, if D(i) be the weight of the i-th training
∂L X example (x(i) , y (i) ), then it counts as D(i) examples. From now, in all calculations, whenever
0= = αi y (i) . (7.6)
∂b i=1 used, the i-th training example counts as D(i) “examples”.
With this in mind, we can now define the full boosting algorithm (Shapire, 1998):
We mentioned earlier that the only data points of importance are the support vectors, which
affect the margin. Originally, we need to find the points that touch the boundary of the margin
Algorithm 17: (AdaBoost)
(i.e., wT x + b = ±1). In this case, howerver, (7.4) gives us the support vectors directly, which
Given dataset {(x(1) , y (1) ), . . . , (x(m) , y (m) )} where x(i) ∈ X, y (i) ∈ Y = {±1}.
are the points (x(i) , y (i) ) where αi > 0.
Initialize D1 (i) = m1 . In practice, this also happens quite quickly. In fact, Schapire (1989) showed that in digit recog-
For t = 1, . . . , T : nition, the testing error can still decrease even after the training error reaches 01 . Boosting is
also robust to overfitting.
• Train weak learner using distribution Dt .
Some weak learners also have their own ensemble methods apart from AdaBoost. For example,
• Get weak classifier ht : X → R.
  an ensemble of decision tree is called a random forest. For each tree we select a subset of
P  attributes (recommended subset size = square root of number of total attributes) and build the
• Compute the error ǫt = m (i)
i=1 Dt (t)I ht (x ) 6= y
(i)
and strength αt = 12 ln 1−ǫ
ǫt
t
.
tree using only the selected attributes. An input sample is the classified using majority voting.
• Update
Dt (i) exp(−αt y (i) ht (x(i) )
Dt+1 (i) =
Zt
where m
X
Zt = Dt (i) exp(−αt y (i) ht (x(i) ))
i=1
Pm
is a normalization factor (chosen so that i=1 Dt+1 (i) = 1).
T
!
X
Output the final classifier H(x) = sign αt ht (x) .
t=1
Substituting the above results back into (7.3), we get the equivalent problem of
X 1X An example is shown below.
max αi − αi αi y (i) y (j) (x(i) )T x(j) (7.7)
α
i
2 i,j
P
where i αi y (i) = 0 and αi ≥ 0 for all i. After solving for α in this new problem, to evaluate a
new sample x, we simply compute
!
X
ŷ = sign(wT x + b) = sign αi yi (x(i) )T x + b . (7.8)
i

Note that both the optimization function (7.7) and decision function (7.8) rely on the sum of
dot products (x(j) )T x(i) , which can be expensive to compute.

7.3.2 Transformation of inputs


When the data is not linearly separable, the original input space (x) can be mapped to some 1
Note that the training error reported in this graph is the global training error where each input is weighed
higher-dimensional feature space (φ(x)) where the training set is separable. √ equally as usual. During the iterations of AdaBoost, however, we are concerned with the weighted errors ǫt . In
For instance, we can map (x) → (x, x2 ) (from 1D to 2D) and (x1 , x2 ) → (x21 , x22 , 2x1 x2 ) (from this case, while the global training error is 0, the ǫt ’s may still be > 0, so there is room for improvement, and
2D to 3D): therefore the test error can still decrease.

38 43 48
Chapter 9 Chapter 10 Chapter 11

Principal Component Analysis Hidden Markov Model Reinforcement Learning

9.1 Introduction 10.1 Introduction 11.1 Markov decision process


Suppose we are given data points in d-dimensional space and want to project them into a Definition 11: (Hidden Markov Model) A Markov decision process is a set of nodes and edges where nodes represent states and edges
lower dimensional space while preserving as much information as possible (e.g., find best planar A Hidden Markov Model consists of the followings: represent transition. In this case, the transitions are only based on the previous state (same as
approximation to 3D data or 104 D data). In particular, choose an orthogonal projection that in HMM). Unlike HMM, however:
minimizes the squared error in reconstructing original data. • A set of states S = {s1 , s2 , . . . , sn }. At each time t we are in exactly one of these
Like auto-encoding neural networks, PCA learns re-representation of input data that can best states, denoted by qt . • We know all the states, each state associated with a reward.
reconstruct it. However, PCA has some differences: • We can have an influence on the transition.
• A list π1 , π2 , . . . , πn where πi is the probability that we start at state i.
• The learned encoding is a linear function of inputs
• A transition probability matrix Aj,i = P (qt = si | qt−1 = sj ), which denotes the
• No local minimum problems when training probability of transitioning from sj to si .

• Given d-dimensional data X, learns d-dimensional representation where: • A set of possible outputs Σ, at each time t we emit a symbol σt ∈ Σ.

– the dimensions are orthogonal • An emission probability matrix Bt,i = P (ot | si ), which denotes the probability of
emitting symbol σt at state si .
– The top k dimensions are the k-dimensional linear re-representation that minimizes
reconstruction error (sum of squared errors)
For example, a two-state HMM may look like the followings:
In particular, PCA involves orthogonal projection of the data onto a lower-dimensional linear
space that equivalently:

1. minimizes the mean squared distance between data points and projections

2. maximizes variance of projected data

PCA has the following properties:

• PCA vectors originate from the center of mass (usually we center the data as the first step)

• Principal component #1: points in the direction of the largest variance


Figure 10.1: An example Hidden Markov Model with 2 states.
• Each subsequent principal component:
Note the Markov property from the above definition: given qt , qt+1 is conditionally inde- An obvious question for such models: what is the combined expected value for each state? What
– is orthogonal to the previous ones, and pendent on qt−1 or any earlier time point. In other words, knowing qt is sufficient to infer can we expect to earn over our lifetime if we become asst. prof / go to industry? Before we
– points in the directions of the largest variance of the residual subspace about the state of qt+1 . answer this quesiton, we need to define a model for future rewards. In particular, the value of a
With n states and m output symbols, we can see that there are n starting parameters πi , n2 current award is higher than the value of future awards (inflation, confidence). This discounted
transition probabilities Aji , and mn emission probabilities Bik , for a total of n2 + mn + n = reward model is specified using a parameter 0 < γ < 1. Therefore, if we let rt be the reward at
O(n2 + mn) parameters. We will discuss how to learn these parameters from data later on, but time t, then the total reward is
X∞
let’s first focus on the inference task. Assuming all of these parameters are already known, what
Total = γ t rt , (11.1)
kind of information can we infer?
t=0

49 54 59

9.2 PCA algorithms 10.2 Inference in HMM which does converge because γ ∈ (0, 1).

Here we present 3 different algorithms for performing PCA. There are three big questions that a learned HMM can answer: Now, let’s define J ∗ (si ) as the expected discounted sum of rewards when starting at state si . It
follows that
Algorithm 18: (Sequential PCA) 1. What is P (qt = si )? In other words, what is the probability that we end up in state si at n
X
Given centered data {x(1) , x(2) , . . . , x(m) }, compute the principal vectors: time t, without observing any output? J ∗ (si ) = ri + γ pik J ∗ (sk ), (11.2)
k=1
m 2. What is P (qt = si | o1 o2 . . . ot )? In other words, given a sequence of output symbols
1 X T (i) 2 where pik is the transition probability from si to sk and the sum represents the expected pay for
w1 = arg max (w x ) o1 o2 . . . ot , what is the probability that we end up in state si at time t? all possible transitions from si .
kwk=1 m i=1
m
1 X T (i) 3. What is arg max P (q1 q2 . . . qt | o1 o2 . . . ot )? In other words, given a sequence of output
w2 = arg max (w (x − w1 w1T x(i) ))2 q1 q2 ...qt We have n equations like (11.2), one for each si , so this is a linear system of equations and
kwk=1 m i=1 symbols o1 o2 . . . ot , what is the sequence of states q1 q2 . . . qt that is most likely to generate a closed form solution can be derived, but may be time consuming. It also doesn’t generalize
this output? to non-linear models. Alternatively, this problem can be solved in an iterative manner: define
...
J t (si ) as the expected discounted reward after t steps, then J 1 (si ) = ri and
m
1 X T (i) X
k−1 Before moving on, we show an example application of these questions. A popular technique of
wk = arg max (w (x − wj wjT x(i) ))2 modeling student knowledge in Educational Data Mining is called Bayesian Knowledge Tracing. X
kwk=1 m i=1 J t+1 (si ) = ri + γ pik J t (sk ). (11.3)
j=1 The high-level goal is: given data about the correctness of a student’s answers on a set of
k
problems related to a certain skill, can we say whether the student has mastered this skill?
In the Sequential algorithm, to find w1 , we maximize the variance of projection of x. To
find w2 , we maximize the variance of the projection in the residual subspace. This can be computed via dynamic programming, and we can stop when max |J t+1 (si )−J t (si )| <
i
ǫ for some threshold ǫ, which we know will happen because J t (si ) converges.

The Sequential algorithm is intuitive and gives a sense of what to look for in a principal vector.
However, it is slow and not often used in practice unless we only care about the first principal
vector.

Algorithm 19: (Sample covariance matrix PCA)


Given data {x(1) , x(2) , . . . , x(m) }, compute covariance matrix
m
1 X (i)
Σ= (x − x̄)(x(i) − x̄)T
m i=1
P
where x̄ = m1 m (i)
i=1 x . The principal vectors are the eigenvectors of Σ, and larger eigenval-
ues corresponds to more important eigenvectors.

While straightforward, the computation of Σ and its eigenvectors is computationally expensive.


In practice, the most useful method is by singular value decomposition (SVD), which avoids
explicitly computing Σ.
11.2 Reinforcement learning - No action
In reinforcement learning, we use the same Markov model with rewards and actions. But there
are a few differences:

1. We do not assume we know the Markov model

50 55 60

Algorithm 20: (SVD PCA) Definition 12: (Bayesian Knowledge Tracing) 2. We adapt to new observation (online vs offline)
Perform SVD of the centered data matrix X = (x(1) , . . . , x(m) ) into For each skill, BKT models student knowledge as a binary latent variable in an HMM , In other words, we want to learn the expected reward but do not know the model, and in this
which consists of the followings: case we do so by learning both the reward and model at the same time (e.g., game playing, robot
X = U SV T
• Two states: S = {Mastered, Unmastered}. interacting with environment). Unlike HMM, if we move to a state, we know which state that is
where U and V are orthonormal and S is a diagonal matrix. The columns of V are the (i.e., the states are observed); other than that, however, we don’t know the reward at that state
eigenvectors and the diagonal values of S - which are square roots of the eigenvectors of V - • πMastered : the probability of the student starting at Mastered, i.e., knowing the skill or the transition probabilities.
denote the importance of each eigenvector in descending order. The top k principal vectors beforehand.
are the columns of V T are the leftmost k columns of V . • pLearn : the probability of transitioning from Unmastered to Mastered. pForget : the 11.2.1 Supervised RL
probability of transitioning from Mastered to Unmastered. More formally, we define the scenario in reinforcement learning as follows: we are wandering the
Formally, the SVD of a matrix A is just the decomposition of A into three other matrices, which
world, and at each time point we see a state and a reward. Our goal is to compute the sum of
we call U , S, and V . The dimensions of these matrices are given as subscripts in the formula • Two possible outputs C (correct) and I (incorrect): whether student’s answer to a
discounted rewards for each state J est (si ). For example, given the following observations
below: problem is correct or incorrect.
T
An×m = Un×n Sn×m Vm×m .
• pGuess = p(ot = C | qt = Unmastered): the probability of getting a correct answer s1 , 4 s2 , 0 s3 , 2 s2 , 2 s4 , 0
T
The columns of U are orthonormal eigenvectors of AA . The columns of V are orthonormal despite not mastering the skill, i.e., guessing.
eigenvectors of AT A. The matrix S is diagonal, with the square roots of the eigenvalues from
U (or V ; the eigenvalues of AT A are the same as those of AAT ) in descending order. These • pSlip = p(ot = I | qt = Mastered): the probability of getting an incorrect answer
eigenvalues are called the singular values of A. despite mastering the skill, i.e., slipping.

1 − pLearn
9.3 PCA applications 1 − pForget

The main applications of PCA are:


pForget
• Data visualization - by reducing the number of dimensions to 2 or 3, we can plot the data
points on a graph
Mastered Unmastered
• Noise reduction, e.g. eigenfaces
pLearn
• Data compression

9.3.1 Eigenfaces 1 − pGuess In general, we have the supervised learning algorithm for RL as follows.
1 − pSlip
We want to identify specific person, based on facial image, robust to glasses, lighting, facial pSlip pGuess Algorithm 24: (Supervised reinforcement learning)
expression, ... (i.e., they are considered noise in this case). Each image is 256 × 256 pixels so
Observe set of states and rewards (s(0), r(0)), (s(1), r(1)), . . . , (s(T ), r(T )).
each input x is 2562 = 65536 dimensional.
For t = 0, . . . , T compute discounted sum
Since the number of dimensions is too large, we cannot perform classification directly. Instead, C I
we use PCA on the whole dataset to get “principal component” images (the eigenfaces), then T
X
classify based on projection weights onto these principal component images. J(t) = γ i−t r(i).
Using Inference #2, we can ask questions like: if the student submits 5 answers and gets the i=t
first 3 correct but last 2 incorrect, what is the probability that she has mastered the skill? More
formally, what is P (Mastered | CCCII)? Is this different from, for example, P (Mastered | IICCC) Compute J est (si ) (mean of J(t) for t such that s(t) = si ):
(getting first 2 incorrect but last 3 correct)? PT
J(t)I (s(t) = si )
J est (si ) = t=0
P T
.
I (s(t) = si )
10.2.1 What is P (qt = si )? t=0

Since we don’t have observed data, the emission probabilities can be ignored. Instead, we simply Here we assume that we observe each state frequently enough and that we have many observa-
rely on the priors and transition probabilities. For example, in Figure 10.1 we can compute tions so that the final observations do not have a big impact on our prediction. Each update
P (q2 = A) as takes O(n) where n is the number of states, since we are updating vectors containing entries for
all states. Space is also O(n). Convergence to J ∗ can be proven, and the algorithm can be more
P (q2 = A) = P (q2 = A | q1 = A) · p(q1 = A) + p(q2 = A | q1 = B) · p(q1 = B)
efficient by ignoring states for which discounted factor γ i is very low already.
= AAA · πA + ABA · πB , However, the supervised learning approach has two problems:

51 56 61

and then P (q3 = B) as • Takes a long time to converge, because we don’t try to learn the underlying MDP model,
but just focus on J est .
P (q3 = B) = P (q3 = B | q2 = A) · p(q2 = A) + P (q3 = B | q2 = B) · p(q2 = B)
= AAB · (AAA · πA + ABA · πB ) + ABB · (AAB · πA + ABB · πB ). • Does not use all available data, we can learn transition probabilities as well.

In general, X In other words, we want to utilize the fact that there is an underlying model and the transitions
P (qt = si ) = P (si | q1 q2 . . . qt−1 )P (q1 q2 . . . qt−1 ). (10.1) are not completely random.
q1 ,q2 ,...,qt−1 ∈S

However, this is too costly to compute, with runtime O(2n−1 ). Instead, an optimization trick is 11.2.2 Certainty-Equivalence learning
to use dynamic programming:
Algorithm 25: (Certainty-Equivalence (CE) learning)
Algorithm 21: (Computing final state without observations) We keep track of 3 vectors:
We perform two steps:
• Count(s): number of times we visited state s
• Base case: P (q1 = si ) = πi
• J(s): sum of rewards from state s
• Inductive case:
X X • T rans(i, j): number of time we transitioned from si to sj
P (qt+1 = si ) = P (qt+1 = si | qt = sj ) · P (qt = sj ) = Aji · P (qt = sj ). (10.2)
j∈S j∈S
When we visit state si , receive reward r and move to state sj we do the following:

• Counts(si ) = Counts(si ) + 1
Suppose there are m instances each with dimension N (in this case m = 50, , N = 65536). Given 10.2.2 What is P (qt = si | o1 o2 . . . ot )? • J(si ) = J(si ) + r
a N × N covariance matrix Σ, can compute:
Using the chain rule, we first see that • T rans(i, j) = T rans(i, j) + 1
• All N eigenvectors / eigenvalues in O(N 3 )
P (qt = si ∧ o1 o2 . . . ot ) P (qt = si ∧ o1 o2 . . . ot )
• First k eigenvectors / eigenvalues in O(kN 2 ) P (qt = si | o1 o2 . . . ot ) = =X (10.3) At any time, we can estimate:
P (o1 o2 . . . ot ) P (qt = sj ∧ o1 o2 . . . ot )
But this is expensive if N = 65536. However, there is a clever workaround, since we note that j∈S • Reward estimate rest (s) = J(s)/Counts(s)
m ≪ 65536. Specifically, we first compute the eigenvectors v’s of L = X T X (which is much
smaller, only m × m), then for each v, Xv would be an eigenvector of X T X. This motivates us to define αt (i) = P (qt = si ∧o1 o2 . . . ot ). We can then compute αt (i) as follows. • Transition probability estimate

Diving in the Math 14 - Proof of workaround for eigenfaces • pest (j | i) = T rans(i, j)/Counts(si )
Algorithm 22: (Computing final state given observed output sequence)
We want to prove that if v is eigenvector of L = X T X then Xv is eigenvector of Σ = XX T .
To compute αt (i), we perform two steps: After learning the model, we can now have an estimate which we can solve for all states si :
Based on the definition of eigenvector, there exists γ such that
X
• Base case: α1 (i) = P (q1 = si ∧ o1 ) = P (o1 | q1 = s1 )P (q1 = s1 ) = B1,i · πi . J est (si ) = rest (si ) + γ pest (sj | si )J est (si ), i = 1, . . . , n (11.4)
Lv = γv
j
X T Xv = γv • Inductive case: X
αt+1 (i) = αt (j) · Aj,i · Bt+1,i . (10.4)
X(X T Xv) = X(γv) = γXv
j∈S
(XX T )Xv = γ(Xv)
Σ(Xv) = γ(Xv) It follows that
αt (i)
P (qt = si | o1 o2 . . . ot ) = X (10.5)
Again, using the definition of eigenvector, we see that Xv is an eigenvector of Σ, also with αt (j)
eigenvalue γ. j∈S

In other words, we do not have to compute the eigenvalues of Σ directly from Σ, but through
L. This would reduce the runtime to O(N m2 ) + O(km2 ), where k is the specified number of 10.2.3 What is arg max P (q1 q2 . . . qt | o1 o2 . . . ot )?
q1 q2 ...qt
eigenvectors (i.e., the number of dimensions we want to reduce to).
We can then reconstruct the faces using some of the top principal vector. As more eigenvectors Let
are used, we get back more detailed faces but without noises such as lighting, glasses and facial δt (i) = max P (q1 . . . qt−1 ∧ qt = si ∧ o1 . . . ot ). (10.6)
q1 ,...,qt−1 ∈S
expression. The below figures demonstrate we can reconstruct one particular face, starting from
using only one principal vectors, then adding more and more. The circled face denotes the best In other words, δt (i) is the probability of the most likely path from time 1 to t that produces
approximation without noise. output o1 . . . ot and ends in si . We can then compute δt (i) as follows.

52 57 62

Algorithm 23: (Viterbi Algorithm)


We perform two steps:

• Base case: δ1 (i) = πi · B1,i .

• Inductive case:
δt+1 (i) = max δt (i) · Aj,i · Bt+1,i . (10.7)
j

It follows that arg max P (q1 q2 . . . qt | o1 o2 . . . ot ) is the path defined by arg max δt (j).
q1 q2 ...qt j

Note: this method is quite old. Nowadays we use deep neural network.

9.3.2 Image compression


To compress an image, we can divide it in patches (12 × 12 pixels on a grid), so each patch is a
144-D vector input. Using PCA on these inputs, reconstructing the patches, then putting them
back together again will give us the compressed version. In some cases, using only 13 principal
vectors can already reduce the relative error to 5% (i.e., most information is in the top principal
vectors). The runtime of CE comes from two steps: update (O(1)) and solving MDP (O(n3 ) using matrix
inversion). The space is O(n2 ) for transition probabilities.
To reduce runtime, we could use the “One backup” version, which updates J est (si ) for the current
9.4 Shortcomings state si while learning the model, instead of solving n equations after learning like in (11.4). In
this case, the runtime is only O(n), and we can sill prove convergence to J ∗ (but slower than
PCA is unsupervised and doesn’t care about the labels. It maximizes the variance, independence CE). The space remains at O(n2 ).
of class. For example, in the plot below, if we want to reduce the dimension to 1 while preserving
class separations, we would pick the green line. However, PCA would pick the magenta line
instead. 11.2.3 Temporal difference learning
We now look at another algorithm with the same efficiency as one backup CE but requires much
less space. In particular, we can ignore all the rest and pest and only focus on J est with a new
approximation rule.

Algorithm 26: (Temporal difference (TD) learning)


We only maintain the J est array. Assume we have J est (s1 ), . . . , J est (sn ). If we observe a
transition from state si to state sj and a reward r, we update using the following rule

J est (si ) = (1 − α)J est (si ) + α(r + γJ est (sj )), (11.5)

where α is a hyper-parameter to determine how much weight we place on the current obser-
Furthermore, PCA can only capture linear relationships. vation (and can change during the algorithm, unlike γ).

As always, choosing a good α is an issue. Nevertheless, it can be proven that TD learning is


guaranteed to converge if:

• All states are visited often.


P
• t αt = ∞
P 2
• t αt < ∞

53 58 63
For example, αt = Ct for some constant t would satisfy both requirements. 6 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS
Now the runtime of TD is O(1) because there is only one update (11.5) at each iteration, and
the space is O(n) because of the J est array.
Here is a summary so far of the four reinforcement learning algorithms.
We expect the
Method Time Space sample mean to be
Supervised learning O(n) O(n)
CHAPTER H1: Children equal to the H1: Children
watch less than watch more

8
population mean.
CE learning
One backup CE
TD learning
O(n3 )
O(n)
O(1)
O(n2 )
O(n2 )
O(n)
Introduction to 3 hours of TV
per week.
than 3 hours of
TV per week.

11.3 Reinforcement learning with action - Policy learning Hypothesis


So far we assumed that we cannot impact the outcome transition. In real world situations we
often have a choice of actions we take (as we discussed for MDPs). How can we learn the best
policy for such cases?
Testing FIGURE 8.2
µ=3 µ=3

The alternative hypothesis


determines whether to place
the level of significance in one
8.1 Inferential Statistics or both tails of a sampling µ=3
and Hypothesis Testing distribution. Sample means
H1: Children
that fall in the tails are
8.2 Four Steps to unlikely to occur (less than a do not watch
LEARNING OBJECTIVES Hypothesis Testing 5% probability) if the value 3 hours of
stated for a population mean TV per week.
After reading this chapter, you should be able to: 8.3 Hypothesis Testing and
in the null hypothesis is true.
Sampling Distributions
8.4 Making a Decision:
1 Identify the four steps of hypothesis testing. Types of Error
sample mean that is beyond 2 SD from the population mean. For the children
8.5 Testing a Research watching TV example, we can look for the probability of obtaining a sample mean
2 Define null hypothesis, alternative hypothesis, Hypothesis: Examples beyond 2 SD in the upper tail (greater than 3), the lower tail (less than 3), or both
level of significance, test statistic, p value, and Using the z Test
tails (not equal to 3). Figure 8.2 shows that the alternative hypothesis is used to
statistical significance. 8.6 Research in Focus: determine which tail or tails to place the level of significance for a hypothesis test.
Directional Versus
Nondirectional Tests Step 3: Compute the test statistic. Suppose we measure a sample mean equal to
3 Define Type I error and Type II error, and identify the NOTE: The level of
8.7 Measuring the Size of significance in hypothesis 4 hours per week that children watch TV. To make a decision, we need to evaluate
type of error that researchers control.
an Effect: Cohen’s d testing is the criterion we how likely this sample outcome is, if the population mean stated by the null
use to decide whether the hypothesis (3 hours per week) is true. We use a test statistic to determine this
4 Calculate the one-independent sample z test and 8.8 Effect Size, Power, and
value stated in the null likelihood. Specifically, a test statistic tells us how far, or how many standard
Sample Size
interpret the results. hypothesis is likely to be true. deviations, a sample mean is from the population mean. The larger the value of the
8.9 Additional Factors That test statistic, the further the distance, or number of standard deviations, a sample
Increase Power
5 Distinguish between a one-tailed and two-tailed test, mean is from the population mean stated in the null hypothesis. The value of the
and explain why a Type III error is possible only with 8.10 SPSS in Focus: test statistic is used to make a decision in Step 4.
one-tailed tests. A Preview for
Note the difference in the pest table - while the columns are still the states (because we only Chapters 9 to 18
The test statistic is a mathematical formula that allows researchers to
transition from state to state), the rows are now (state, action) pair, because each action leads 6 Explain what effect size measures and compute a 8.11 APA in Focus: DEFINITION determine the likelihood of obtaining sample outcomes if the null hypothesis
to a different transition. Our goal is to learn the action that leads to the most reward. In Reporting the Test were true. The value of the test statistic is used to make a decision regarding
Cohen’s d for the one-independent sample z test. Statistic and Effect Size
particular, we can update CE by setting the null hypothesis.
! 7 Define power and identify six factors that influence power.
X
J est (si ) = rest (si ) + max γ pest (sj | si , a)J est (sj ) . (11.6) NOTE: We use the value of the Step 4: Make a decision. We use the value of the test statistic to make a decision
a
j 8 Summarize the results of a one-independent sample test statistic to make a decision about the null hypothesis. The decision is based on the probability of obtaining a
z test in American Psychological Association (APA) regarding the null hypothesis. sample mean, given that the value stated in the null hypothesis is true. If the
As mentioned above, we can also use TD learning for better efficiency. However, TD is model format.
free, so in this context, we can adjust TD to learn policies by defining Q∗ (si , a) = expected sum

64

of future (discounted) rewards if we start at state si and take action a. Then, when we take a 2 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS CH APTER 8: INTROD UCTIO N TO HYPO THES IS TESTIN G 7
specific action a in state si and then transition to state sj we can update

Qest (si , a) = (1 − α)Qest (si , a) + α(ri + γ max Qest (sj , a′ )). (11.7) 8.1 INFERENTIAL STATISTICS AND HYPOTHESIS TESTING probability of obtaining a sample mean is less than 5% when the null hypothesis is
′ a
true, then the decision is to reject the null hypothesis. If the probability of obtaining
Instead of the J est vector we maintain the Qest matrix, which is a rather sparse n by m matrix a sample mean is greater than 5% when the null hypothesis is true, then the
We use inferential statistics because it allows us to measure behavior in samples to decision is to retain the null hypothesis. In sum, there are two decisions a researcher
(n states and m actions). learn more about the behavior in populations that are often too large or inaccessi- can make:
In practice, when choosing the next action, we may not necessarily pick the one that results in ble. We use samples because we know how they are related to populations. For
the highest expected sum of future rewards, because we are only sampling from the distribution example, suppose the average score on a standardized exam in a given population is 1. Reject the null hypothesis. The sample mean is associated with a low proba-
of possible outcomes. We do not want to avoid potentially beneficial actions. Instead, we can 1,000. In Chapter 7, we showed that the sample mean as an unbiased estimator of bility of occurrence when the null hypothesis is true.
take a more probabilistic approach the population mean—if we selected a random sample from a population, then on
average the value of the sample mean will equal the population mean. In our exam- 2. Retain the null hypothesis. The sample mean is associated with a high proba-
  bility of occurrence when the null hypothesis is true.
1 Qest (si , a) ple, if we select a random sample from this population with a mean of 1,000, then
p(a) = exp − , (11.8) on average, the value of a sample mean will equal 1,000. On the basis of the central
Z f (t)
limit theorem, we know that the probability of selecting any other sample mean The probability of obtaining a sample mean, given that the value stated in the
value from this population is normally distributed. null hypothesis is true, is stated by the p value. The p value is a probability: It varies
where Z is a normalizing constant and f (t) decreases as time t goes by, to represent that we
In behavioral research, we select samples to learn more about populations of between 0 and 1 and can never be negative. In Step 2, we stated the criterion or
are more confident in the learned model. We can initialize Q values to be high to increase the probability of obtaining a sample mean at which point we will decide to reject the
interest to us. In terms of the mean, we measure a sample mean to learn more about
likelihood that we will explore more options. Finally, it can be shown that Q learning converges the mean in a population. Therefore, we will use the sample mean to describe the value stated in the null hypothesis, which is typically set at 5% in behavioral research.
to optimal policy. population mean. We begin by stating the value of a population mean, and then we To make a decision, we compare the p value to the criterion we set in Step 2.
select a sample and measure the mean in that sample. On average, the value of the
sample mean will equal the population mean. The larger the difference or discrep- A p value is the probability of obtaining a sample outcome, given that the
ancy between the sample mean and population mean, the less likely it is that we value stated in the null hypothesis is true. The p value for obtaining a sample DEFINITION
could have selected that sample mean, if the value of the population mean is cor- outcome is compared to the level of significance.
rect. This type of experimental situation, using the example of standardized exam
scores, is illustrated in Figure 8.1. Significance, or statistical significance, describes a decision made concerning a
value stated in the null hypothesis. When the null hypothesis is rejected, we reach
significance. When the null hypothesis is retained, we fail to reach significance.

FIGURE 8.1
When the p value is less than 5% (p < .05), we reject the null hypothesis. We will NOTE: Researchers make
We expect the refer to p < .05 as the criterion for deciding to reject the null hypothesis, although decisions regarding the null
The sampling distribution for a sample mean to be note that when p = .05, the decision is also to reject the null hypothesis. When the hypothesis. The decision can
population mean is equal to 1,000. equal to the p value is greater than 5% (p > .05), we retain the null hypothesis. The decision to be to retain the null (p > .05)
If 1,000 is the correct population population mean. reject or retain the null hypothesis is called significance. When the p value is less or reject the null (p < .05).
mean, then we know that, on
than .05, we reach significance; the decision is to reject the null hypothesis. When
average, the sample mean will
equal 1,000 (the population mean). the p value is greater than .05, we fail to reach significance; the decision is to retain
Using the empirical rule, we know the null hypothesis. Figure 8.3 shows the four steps of hypothesis testing.
that about 95% of all samples
selected from this population will
have a sample mean that falls LEARNING
within two standard deviations
1. State the four steps of hypothesis testing. C H EC K 2
(SD) of the mean. It is therefore
unlikely (less than a 5% 2. The decision in hypothesis testing is to retain or reject which hypothesis: the
probability) that we will measure a null or alternative hypothesis?
sample mean beyond µ = 1000
2 SD from the population mean, if 3. The criterion or level of significance in behavioral research is typically set at
the population mean is indeed what probability value?
correct.
4. A test statistic is associated with a p value less than .05 or 5%. What is the deci-
sion for this hypothesis test?
5. If the null hypothesis is rejected, then did we reach significance?
The method in which we select samples to learn more about characteristics in
a given population is called hypothesis testing. Hypothesis testing is really a
outcome; 4. Reject the null; 5. Yes.
Step 3: Compute the test statistic. Step 4: Make a decision; 2. Null; 3. A .05 or 5% likelihood for obtaining a sample
systematic way to test claims or ideas about a group or population. To illustrate, Answers: 1. Step 1: State the null and alternative hypothesis. Step 2: Determine the level of significance.

65

C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 3 8 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

suppose we read an article stating that children in the United States watch an aver-
age of 3 hours of TV per week. To test whether this claim is true, we record the time STEP 1: State the hypotheses.
(in hours) that a group of 20 American children (the sample), among all children in
Chapter 12 the United States (the population), watch TV. The mean we measure for these 20
A researcher states a null
hypothesis about a value in the
POPULATION
children is a sample mean. We can then compare the sample mean we select to the population (H0) and an
population mean stated in the article. alternative hypothesis that

Generalization and Model Selection Hypothesis testing or significance testing is a method for testing a claim or
contradicts the null hypothesis.
STEP 2: Set the criteria for a --------------------------------------------------
hypothesis about a parameter in a population, using data measured in a DEFINITION decision. A criterion is set upon Level of Significance (Criterion)
sample. In this method, we test some hypothesis by determining the which a researcher will decide --------------------------------------------------
likelihood that a sample statistic could have been selected, if the hypothesis whether to retain or reject the
12.1 True risk vs Empirical risk regarding the population parameter were true. value stated in the null
hypothesis. STEP 4: Make a decision.
Conduct a study
A sample is selected from the If the probability of obtaining a
Definition 13: (True risk) The method of hypothesis testing can be summarized in four steps. We will with a sample
population, and a sample mean sample mean is less than 5%
True risk is the target performance measure. It is defined as is the probability of misclassifi- describe each of these four steps in greater detail in Section 8.2. selected from a
is measured. when the null is true, then reject
cation P (f (X) 6= Y ) in classification and mean squared error E[(f (X) − Y )2 ] in regression. population.
the null hypothesis.
More generally, it is the expected performance on a random test point (X, Y ). 1. To begin, we identify a hypothesis or claim that we feel should be tested. If the probability of obtaining a
For example, we might want to test the claim that the mean number of sample mean is greater than 5%
hours that children in the United States watch TV is 3 hours. when the null is true, then
While we want to minimize true risk, we do not know that the underlying distribution of X and STEP 3: Compute the test retain the null hypothesis.
statistic. This will produce a Measure data
Y is. What we do know are the samples (Xi , Yi ), which give us the empirical risk. 2. We select a criterion upon which we decide that the claim being tested is
and compute
true or not. For example, the claim is that children watch 3 hours of TV per value that can be compared to
the criterion that was set before a test statistic.
Definition 14: (Empirical risk) week. Most samples we select should have a mean close to or equal to
3 hours if the claim we are testing is true. So at what point do we decide that the sample was selected.
Empirical riskP is the performance on training data. It is defined as proportion of
Pmisclassified the discrepancy between the sample mean and 3 is so big that the claim
examples n1 ni=1 I (f (Xi ) 6= Yi ) in classification and average squared error n1 ni=1 (f (Xi ) −
we are testing is likely not true? We answer this question in this step of
Yi )2 in regression. hypothesis testing. FIGURE 8.3

3. Select a random sample from the population and measure the sample mean. A summary of hypothesis testing.
So we want to minimize the empirical risk and evaluate the true risk, but this may lead to
For example, we could select 20 children and measure the mean time (in
overfitting (i.e., small training error but large generalization error). For instance, the following hours) that they watch TV per week.
graph shows two classifiers for a binary classification problem (football player or not). While
4. Compare what we observe in the sample to what we expect to observe if NOTE: Hypothesis testing is
8.3 HYPOTHESIS TESTING AND
the classifier on the right has zero training error, we are much less inclined to believe that it
captures the true distribution. Here it is more likely that football players simply have higher the claim we are testing is true. We expect the sample mean to be around the method of testing whether SAMPLING DISTRIBUTIONS
3 hours. If the discrepancy between the sample mean and population mean claims or hypotheses regarding
height and weight, which match better with the classifier on the left. is small, then we will likely decide that the claim we are testing is indeed a population are likely to be The logic of hypothesis testing is rooted in an understanding of the sampling
true. If the discrepancy is too large, then we will likely decide to reject the true. distribution of the mean. In Chapter 7, we showed three characteristics of the
claim as being not true. mean, two of which are particularly relevant in this section:

1. The sample mean is an unbiased estimator of the population mean. On


average, a randomly selected sample will have a mean equal to that in the
LEARNING population. In hypothesis testing, we begin by stating the null hypothesis.
1. On average, what do we expect the sample mean to be equal to? C H EC K 1 We expect that, if the null hypothesis is true, then a random sample selected
from a given population will have a sample mean equal to the value stated
2. True or false: Researchers select a sample from a population to learn more about in the null hypothesis.
characteristics in that sample.
2. Regardless of the distribution in the population, the sampling distribution
characteristics in the population that the sample was selected from. of the sample mean is normally distributed. Hence, the probabilities of all
Answers: 1. The population mean; 2. False. Researchers select a sample from a population to learn more about other possible sample means we could select are normally distributed. Using
this distribution, we can therefore state an alternative hypothesis to locate
the probability of obtaining sample means with less than a 5% chance of
The question is: when should we not minimize the empirical risk completely? The following being selected if the value stated in the null hypothesis is true. Figure 8.2
graph shows what the empirical risk and true risk may look like as we increase the model shows that we can identify sample mean outcomes in one or both tails.
complexity. Initially both types of risk would decrease, but after some point (the Best Model

66

point), we started fitting the noise instead of the true data. In that case, the empirical risk can 4 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS CH APTER 8: INTROD UCTIO N TO HYPO THES IS TESTIN G 9
keep decreasing while the true risk increases.

To locate the probability of obtaining a sample mean in a sampling distribution,


8.2 FOUR STEPS TO HYPOTHESIS TESTING we must know (1) the population mean and (2) the standard error of the mean
(SEM; introduced in Chapter 7). Each value is entered in the test statistic formula
The goal of hypothesis testing is to determine the likelihood that a population computed in Step 3, thereby allowing us to make a decision in Step 4. To review,
parameter, such as the mean, is likely to be true. In this section, we describe the four Table 8.1 displays the notations used to describe populations, samples, and sampling
steps of hypothesis testing that were briefly introduced in Section 8.1: distributions. Table 8.2 summarizes the characteristics of each type of distribution.

Step 1: State the hypotheses. TABLE 8.1 A review of the notation used for the mean, variance, and standard deviation in population,
sample, and sampling distributions.
Step 2: Set the criteria for a decision.

Step 3: Compute the test statistic.


Characteristic Population Sample Sampling Distribution
Step 4: Make a decision. –
Mean µ M or X µM = µ

Step 1: State the hypotheses. We begin by stating the value of a population mean
Variance σ2 s2 or SD 2 2 σ2
in a null hypothesis, which we presume is true. For the children watching TV σM =
example, we state the null hypothesis that children in the United States watch an n

Again, we do not know how true risk in practice, which makes this a difficult problem. Can we average of 3 hours of TV per week. This is a starting point so that we can decide Standard σ s or SD σ
σM =
whether this is likely to be true, similar to the presumption of innocence in a deviation n
estimate the true risk in a way better than just using the empirical risk? One way is to use
courtroom. When a defendant is on trial, the jury starts by assuming that the
structural risk minimization. defendant is innocent. The basis of the decision is to determine whether this
assumption is true. Likewise, in hypothesis testing, we start by assuming that the
Definition 15: (Structural risk minimization)
hypothesis or claim we are testing is true. This is stated in the null hypothesis. The
Penalize models using bound on deviation of true and empirical risk basis of the decision is to determine whether this assumption is likely to be true. TABLE 8.2 A review of the key differences between population, sample, and sampling distributions.

fˆn = arg min{R̂n (f ) + λC(f )}, (12.1)


f ∈F The null hypothesis (H0), stated as the null, is a statement about a population
DEFINITION parameter, such as the population mean, that is assumed to be true. Population Distribution Sample Distribution Distribution of Sample Means
where λ is a tuning parameter chosen by model selection, and C(f ) is the bound on deviation The null hypothesis is a starting point. We will test whether the value
What is it? Scores of all persons in a Scores of a select All possible sample means that
from true risk a . In essence, instead of minimizing the unknown true risk directly, we try to stated in the null hypothesis is likely to be true.
population portion of persons from can be drawn, given a certain
minimize an upper bound (with high probability) on the true risk. the population sample size
Keep in mind that the only reason we are testing the null hypothesis is because
we think it is wrong. We state what we think is wrong about the null hypothesis in Is it accessible? Typically, no Yes Yes
an alternative hypothesis. For the children watching TV example, we may have
What is the shape? Could be any shape Could be any shape Normally distributed
reason to believe that children watch more than (>) or less than (<) 3 hours of TV
per week. When we are uncertain of the direction, we can state that the value in the
null hypothesis is not equal to (≠) 3 hours.
NOTE: In hypothesis testing, In a courtroom, since the defendant is assumed to be innocent (this is the null
we conduct a study to test hypothesis so to speak), the burden is on a prosecutor to conduct a trial to show
LEARNING
whether the null hypothesis is evidence that the defendant is not innocent. In a similar way, we assume the null 1. For the following statement, write increases or decreases as an answer. The like- C H EC K 3
likely to be true. hypothesis is true, placing the burden on the researcher to conduct a study to show lihood that we reject the null hypothesis (increases or decreases):
evidence that the null hypothesis is unlikely to be true. Regardless, we always make
a. The closer the value of a sample mean is to the value stated by the null
a decision about the null hypothesis (that it is likely or unlikely to be true). The
hypothesis?
alternative hypothesis is needed for Step 2.
b. The further the value of a sample mean is from the value stated in the null
a
We will discuss how to derive these later. An alternative hypothesis (H1) is a statement that directly contradicts a null hypothesis?
DEFINITION hypothesis by stating that that the actual value of a population parameter is 2. A researcher selects a sample of 49 students to test the null hypothesis that the
less than, greater than, or not equal to the value stated in the null hypothesis. average student exercises 90 minutes per week. What is the mean for the sam-
In other words, we penalize models based on prior information (bias) or information criteria The alternative hypothesis states what we think is wrong about the null
(MDL, AIC, BIC). In ML there is a “no free lunch” theorem: given only the data, we cannot pling distribution for this population of interest if the null hypothesis is true?
hypothesis, which is needed for Step 2.
learn anything. We need some kind of prior information (inductive bias); for example, in using Answers: 1. (a) Decreases, (b) Increases; 2. 90 minutes.
linear regression, our inductive bias is that the data can be fit by a line. The inductive bias
in this case, also called Occam’s Razor, is to seek the simplest explanation (e.g., if a 10-degree

67

polynomial and 100-degree polynomial say roughly the same things, pick the former). C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 5 10 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

Inductive bias can also come from domain knowledge. For example, the function of oil spill
contamination should be smooth (if one point is contaminated, the points around it should be 8.4 MAKING A DECISION: TYPES OF ERROR
as well), while the function of photon arrival is not. Therefore, even if we get the same data,
MAKING SENSE: Testing the Null Hypothesis
the fit functions may look very different. In Step 4, we decide whether to retain or reject the null hypothesis. Because we are
A decision made in hypothesis testing centers on the null hypothesis. This observing a sample and not an entire population, it is possible that a conclusion
means two things in terms of making a decision: may be wrong. Table 8.3 shows that there are four decision alternatives regarding
the truth and falsity of the decision we make about a null hypothesis:
1. Decisions are made about the null hypothesis. Using the courtroom
analogy, a jury decides whether a defendant is guilty or not guilty. The
1. The decision to retain the null hypothesis could be correct.
jury does not make a decision of guilty or innocent because the defendant
is assumed to be innocent. All evidence presented in a trial is to show 2. The decision to retain the null hypothesis could be incorrect.
that a defendant is guilty. The evidence either shows guilt (decision:
3. The decision to reject the null hypothesis could be correct.
guilty) or does not (decision: not guilty). In a similar way, the null
hypothesis is assumed to be correct. A researcher conducts a study show- 4. The decision to reject the null hypothesis could be incorrect.
ing evidence that this assumption is unlikely (we reject the null hypoth-
esis) or fails to do so (we retain the null hypothesis).
TABLE 8.3 Four outcomes for making a decision. The decision can be either correct (correctly reject
2. The bias is to do nothing. Using the courtroom analogy, for the same or retain null) or wrong (incorrectly reject or retain null).
reason the courts would rather let the guilty go free than send the inno-
cent to prison, researchers would rather do nothing (accept previous
notions of truth stated by a null hypothesis) than make statements that Decision
are not correct. For this reason, we assume the null hypothesis is correct,
thereby placing the burden on the researcher to demonstrate that the Retain the null Reject the null
null hypothesis is not likely to be correct.
CORRECT TYPE I ERROR
True
1–α α
Truth in the
Step 2: Set the criteria for a decision. To set the criteria for a decision, we state the TYPE II ERROR CORRECT
population
level of significance for a test. This is similar to the criterion that jurors use in a False β 1–β
criminal trial. Jurors decide whether the evidence presented shows guilt beyond a POWER
An example of penalizing complex models using prior knowledge is regularized linear regres- reasonable doubt (this is the criterion). Likewise, in hypothesis testing, we collect
sion, which uses some norm of regression coefficients as the cost C(f ). An example of penal- data to show that the null hypothesis is not true, based on the likelihood of selecting
a sample mean from a population (the likelihood is the criterion). The likelihood or
izing models based on information content is AIC (C(f ) = # parameters) or BIC (C(f ) =
level of significance is typically set at 5% in behavioral research studies. When the We investigate each decision alternative in this section. Since we will observe a
# parameters × log n). AIC allows # parameters to be infinite as # of training data n becomes probability of obtaining a sample mean is less than 5% if the null hypothesis were sample, and not a population, it is impossible to know for sure the truth in the
large, while BIC penalizes complex models more heavily. true, then we conclude that the sample we selected is too unlikely and so we reject population. So for the sake of illustration, we will assume we know this. This
the null hypothesis. assumption is labeled as truth in the population in Table 8.3. In this section, we will
introduce each decision alternative.
12.2 Model Selection Level of significance, or significance level, refers to a criterion of judgment
upon which a decision is made regarding the value stated in a null hypothesis. DEFINITION
DECISION: RETAIN THE NULL HYPOTHESIS
The criterion is based on the probability of obtaining a statistic measured in a
sample if the value stated in the null hypothesis were true. When we decide to retain the null hypothesis, we can be correct or incorrect. The
In behavioral science, the criterion or level of significance is typically set at correct decision is to retain a true null hypothesis. This decision is called a null
5%. When the probability of obtaining a sample mean is less than 5% if the
result or null finding. This is usually an uninteresting decision because the deci-
null hypothesis were true, then we reject the value stated in the null
sion is to retain what we already assumed: that the value stated in the null hypoth-
hypothesis.
esis is correct. For this reason, null results alone are rarely published in behavioral
research.
The alternative hypothesis establishes where to place the level of significance. The incorrect decision is to retain a false null hypothesis. This decision is an
Remember that we know that the sample mean will equal the population mean on example of a Type II error, or b error. With each test we make, there is always
average if the null hypothesis is true. All other possible values of the sample mean some probability that the decision could be a Type II error. In this decision, we
are normally distributed (central limit theorem). The empirical rule tells us that at decide to retain previous notions of truth that are in fact false. While it’s an error,
least 95% of all sample means fall within about 2 standard deviations (SD) of the we still did nothing; we retained the null hypothesis. We can always go back and
population mean, meaning that there is less than a 5% probability of obtaining a conduct more studies.

68
CHAPTER 8 : I N T ROD U CTI O N T O H Y POT H ESI S T ES TI N G 11 16 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS CH APTER 8: INTROD UCTIO N TO HYPO THES IS TESTIN G 21

Type II error, or beta (b) error, is the probability of retaining a null hypothesis DIRECTIONAL, UPPER-TAIL CRITICAL The two-tailed test is more conservative; it makes it more difficult to reject the NOTE: Two-tailed tests are
that is actually false. DEFINITION null hypothesis. It also eliminates the possibility of committing a Type III error. more conservative and
HYPOTHESIS TESTS (H1: >)
The one-tailed test, though, is associated with greater power. If the value stated in eliminate the possibility of
In Example 8.2, we will use the z test for a directional, or one-tailed test, where the null hypothesis is false, then a one-tailed test will make it easier to detect this committing a Type III error.
NOTE: An upper-tail critical (i.e., lead to a decision to reject the null hypothesis). Because the one-tailed test One-tailed tests are associated
DECISION: REJECT THE NULL HYPOTHESIS NOTE: A Type II error, or beta
the alternative hypothesis is stated as greater than (>) the null hypothesis. A direc-
test is conducted when it is makes it easier to reject the null hypothesis, it is important that we justify that an with more power, assuming the
(β) error, is the probability of
tional test can also be stated as less than (<) the null hypothesis (an example for this
not possible or highly unlikely outcome can occur in only one direction. Justifying that an outcome can occur in value stated in the null
When we decide to reject the null hypothesis, we can be correct or incorrect. The incorrectly retaining the null
alternative is given in Example 8.3). For an upper-tail critical test, or a greater than
that a sample mean will fall only one direction is difficult for much of the data that behavioral researchers mea- hypothesis is wrong.
incorrect decision is to reject a true null hypothesis. This decision is an example of a hypothesis.
statement, we place the level of significance in the upper tail of the sampling distribu-
below the population mean sure. For this reason, most studies in behavioral research are two-tailed tests.
Type I error. With each test we make, there is always some probability that our tion. So we are interested in any alternative greater than the value stated in the null
stated in the null hypothesis.
decision is a Type I error. A researcher who makes this error decides to reject previ- hypothesis. This test is appropriate when it is not possible or highly unlikely that a
ous notions of truth that are in fact true. Making this type of error is analogous to sample mean will fall below the population mean stated in the null hypothesis.
finding an innocent person guilty. To minimize this error, we assume a defendant is LEARNING
innocent when beginning a trial. Similarly, to minimize making a Type I error, we Directional tests, or one-tailed tests, are hypothesis tests where the 1. Is the following set of hypotheses appropriate for a directional or a nondirec- C H EC K 5
assume the null hypothesis is true when beginning a hypothesis test. DEFINITION alternative hypothesis is stated as greater than (>) or less than (<) a value tional hypothesis test?
stated in the null hypothesis. Hence, the researcher is interested in a specific H0: µ = 35
Type I error is the probability of rejecting a null hypothesis that is actually true. alternative from the null hypothesis. H1: µ ≠ 35
Researchers directly control for the probability of committing this type of error. DEFINITION
2. A researcher conducts a one–independent sample z test. The z statistic for the
An alpha (a) level is the level of significance or criterion for a hypothesis test. upper-tail critical test at a .05 level of significance was Zobt = 1.84. What is the
It is the largest probability of committing a Type I error that we will allow and Using the same study from Example 8.1, Templer and Tomeo (2002) reported that the
E X A M PL E 8 . 2 decision for this test?
still decide to reject the null hypothesis. population mean on the quantitative portion of the GRE General Test for students
taking the exam between 1994 and 1997 was 558 ± 139 (µ ± σ). Suppose we select a 3. A researcher conducts a hypothesis test and finds that the probability of select-
Since we assume the null hypothesis is true, we control for Type I error by stating a NOTE: Researchers directly sample of 100 students enrolled in an elite private school (n = 100). We hypothesize ing the sample mean is p = .0689 if the value stated in the null hypothesis is
level of significance. The level we set, called the alpha level (symbolized as α), is the larg- control for the probability of that students at this elite school will score higher than the general population. We true. What is the decision for a hypothesis test at a .05 level of significance?
est probability of committing a Type I error that we will allow and still decide to reject the a Type I error by stating an record a sample mean equal to 585 (M = 585), same as measured in Example 8.1.
null hypothesis. This criterion is usually set at .05 (α = .05), and we compare the alpha alpha (α) level. Compute the one–independent sample z test at a .05 level of significance. 4. Which type of test, one-tailed or two-tailed, is associated with greater power to
level to the p value. When the probability of a Type I error is less than 5% (p < .05), detect an effect when the null hypothesis is false?
we decide to reject the null hypothesis; otherwise, we retain the null hypothesis.
Step 1: State the hypotheses. The population mean is 558, and we are testing Answers: 1. A nondirectional (two-tailed) hypothesis test; 2. Reject the null; 3. Retain the null; 4. One-tailed tests.
The correct decision is to reject a false null hypothesis. There is always some
whether the alternative is greater than (>) this value:
probability that we decide that the null hypothesis is false when it is indeed false. This
decision is called the power of the decision-making process. It is called power because NOTE: The power in hypothesis H0: µ = 558 Mean test scores are equal to 558 in the population of students
it is the decision we aim for. Remember that we are only testing the null hypothesis testing is the probability of at the elite school.
because we think it is wrong. Deciding to reject a false null hypothesis, then, is the
power, inasmuch as we learn the most about populations when we accurately reject
correctly rejecting the value
stated in the null hypothesis.
MEASURING THE SIZE OF AN EFFECT: COHEN’S d 8.7
H1: µ > 558 Mean test scores are greater than 558 in the population of
false notions of truth. This decision is the most published result in behavioral research.
students at the elite school.
A decision to reject the null hypothesis means that an effect is significant. For a
The power in hypothesis testing is the probability of rejecting a false null one-sample test, an effect is the difference between a sample mean and the
hypothesis. Specifically, it is the probability that a randomly selected sample will DEFINITION Step 2: Set the criteria for a decision. The level of significance is .05, which makes population mean stated in the null hypothesis. In Example 8.2, we found a
show that the null hypothesis is false when the null hypothesis is indeed false. the alpha level α = .05. To determine the critical value for an upper-tail critical test, significant effect, meaning that the sample mean, M = 585, was significantly larger
we locate the probability .0500 toward the tail in column C in the unit normal than the value stated in the null hypothesis, µ = 558. Hypothesis testing identifies
table. The z-score associated with this probability is between z = 1.64 and z = 1.65. whether an effect exists in a population. When a sample mean is likely to occur if
The average of these z-scores is z = 1.645. This is the critical value or cutoff for the the null hypothesis were true (p > .05), we decide that an effect doesn’t exist in a
LEARNING rejection region. Figure 8.6 shows that for this test, we place all the value of alpha in population; the effect is insignificant. When a sample mean is unlikely to occur if
1. What type of error do we directly control? C H EC K 4 the upper tail of the standard normal distribution. the null hypothesis were true (p < .05), we decide that an effect does exist in a
population; the effect is significant. Hypothesis testing does not, however, inform
2. What type of error is associated with decisions to retain the null?
us of how big the effect is.
3. What type of error is associated with decisions to reject the null? NOTE: For one-tailed tests, the Step 3: Compute the test statistic. Step 2 sets the stage for making a decision because To determine the size of an effect, we compute effect size. There are two ways
alpha level is placed in a single the criterion is set. The probability is less than 5% that we will obtain a sample to calculate the size of an effect. We can determine:
4. State the two correct decisions that a researcher can make. tail of a distribution. For mean that is at least 1.645 standard deviations above the value of the population
upper-tail critical tests, the mean stated in the null hypothesis. In this step, we will compute a test statistic to
hypothesis.
alpha level is placed above the determine whether or not the sample mean we selected is beyond the critical value 1. How far scores shifted in the population
Answers: 1. Type I error; 2. Type II error; 3. Type I error; 4. Retain a true null hypothesis and reject a false null
mean in the upper tail. we stated in Step 2. 2. The percent of variance that can be explained by a given variable

12 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 17 22 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

8.5 TESTING A RESEARCH HYPOTHESIS: Critical value for an upper- DEFINITION


For a single sample, an effect is the difference between a sample mean and
the population mean stated in the null hypothesis. In hypothesis testing, an
EXAMPLES USING THE Z TEST tail critical test with α = .05 effect is insignificant when we retain the null hypothesis; an effect is
significant when we reject the null hypothesis.
The test statistic in Step 3 converts the sampling distribution we observe into a Effect size is a statistical measure of the size of an effect in a population,
standard normal distribution, thereby allowing us to make a decision in Step 4. The Rejection region which allows researchers to describe how far scores shifted in the population,
test statistic we use depends largely on what we know about the population. When α = .05 or the percent of variance that can be explained by a given variable.
FIGURE 8.6
we know the mean and standard deviation in a single population, we can use the
one–independent sample z test, which we will use in this section to illustrate The critical value (1.645) for a NOTE: Cohen’s d is a measure
Effect size is most meaningfully reported with significant effects when the
the four steps of hypothesis testing. directional (upper-tail critical) of the number of standard
hypothesis test at a .05 level of
decision was to reject the null hypothesis. If an effect is not significant, as in
deviations an effect is shifted
significance. When the test instances when we retain the null hypothesis, then we are concluding that an effect
The one–independent sample z test is a statistical procedure used to test −3 −2 −1 0 1 2 3 above or below the population
statistic exceeds 1.645, we reject does not exist in a population. It makes little sense to compute the size of an effect
DEFINITION Null mean stated by the null
hypotheses concerning the mean in a single population with a known variance. z = 1.645 the null hypothesis; otherwise, we that we just concluded doesn’t exist. In this section, we describe how far scores
retain the null hypothesis. hypothesis.
shifted in the population using a measure of effect size called Cohen’s d.
Recall that we can state one of three alternative hypotheses: A population mean Cohen’s d measures the number of standard deviations an effect shifted above
NOTE: The z test is used to is greater than (>), less than (<), or not equal (≠) to the value stated in a null hypoth- The test statistic does not change from that in Example 8.1. We are testing the same or below the population mean stated by the null hypothesis. The formula for
test hypotheses about a esis. The alternative hypothesis determines which tail of a sampling distribution to population, and we measured the same value of the sample mean. We changed only Cohen’s d replaces the standard error in the denominator of the test statistic with
population mean when the place the level of significance, as illustrated in Figure 8.2. In this section, we will use the location of the rejection region in Step 2. The z statistic is the same computation the population standard deviation (Cohen, 1988):
population variance is known. an example for each type of alternative hypothesis. as that shown in Example 8.1:
M −µ
M − µ 585 − 558 Cohen’s d = .
zobt = = = 1.94 . σ
NONDIRECTIONAL, TWO-TAILED σM 13.9
HYPOTHESIS TESTS (H1: ≠) The value of Cohen’s d is zero when there is no difference between two means
Step 4: Make a decision. To make a decision, we compare the obtained value to the and increases as the differences get larger. To interpret values of d, we refer to Cohen’s
NOTE: Nondirectional In Example 8.1, we will use the z test for a nondirectional, or two-tailed test, critical value. We reject the null hypothesis if the obtained value exceeds the critical effect size conventions outlined in Table 8.6. The sign of d indicates the direction
tests are used to test where the alternative hypothesis is stated as not equal to (≠) the null hypothesis. For value. Figure 8.7 shows that the obtained value (Zobt = 1.94) is greater than the of the shift. When values of d are positive, an effect shifted above the population
hypotheses when we are this test, we will place the level of significance in both tails of the sampling distribu- critical value; it falls in the rejection region. The decision is to reject the null mean; when values of d are negative, an effect shifted below the population mean.
interested in any alternative tion. We are therefore interested in any alternative from the null hypothesis. This is hypothesis. The p value for this test is .0262 (p = .0262). We do not double the
from the null hypothesis. the most common alternative hypothesis tested in behavioral science. p value for one-tailed tests.
NOTE: Hypothesis testing
We found in Example 8.2 that if the null hypothesis were true, then p = .0262 TABLE 8.6 Cohen’s effect size conventions.
determines whether an effect
Nondirectional tests, or two-tailed tests, are hypothesis tests where the that we could have selected this sample mean from this population. The criteria we
exists in a population. Effect
DEFINITION alternative hypothesis is stated as not equal to (≠). The researcher is interested set in Step 2 was that the probability must be less than 5% that we obtain a sample
size measures the size of
in any alternative from the null hypothesis. mean, if the null hypothesis were true. Since p is less than 5%, we decide to reject Description of Effect Effect Size (d )
an observed effect from
the null hypothesis. We decide that the mean score on the GRE General Test in this
small to large. Small d < 0.2

Templer and Tomeo (2002) reported that the population mean score on the Medium 0.2 < d < 0.8
E X A M PL E 8 .1 quantitative portion of the Graduate Record Examination (GRE) General Test for
The test statistic reaches Large d < 0.8
students taking the exam between 1994 and 1997 was 558 ± 139 (µ ± σ). Suppose we
the rejection region; reject
select a sample of 100 participants (n = 100). We record a sample mean equal to 585
the null hypothesis.
(M = 585). Compute the one–independent sample z test for whether or not we will
retain the null hypothesis (µ = 558) at a .05 level of significance (α = .05). Cohen’s d is a measure of effect size in terms of the number of standard
Rejection region DEFINITION deviations that mean scores shifted above or below the population mean
Step 1: State the hypotheses. The population mean is 558, and we are testing α = .05 stated by the null hypothesis. The larger the value of d, the larger the effect
Retain the null
whether the null hypothesis is (=) or is not (≠) correct: in the population.
hypothesis
Cohen’s effect size conventions are standard rules for identifying small,
H0: µ = 558 Mean test scores are equal to 558 in the population. medium, and large effects based on typical findings in behavioral research.
FIGURE 8.7
H1: µ ≠ 558 Mean test scores are not equal to 558 in the population. −3 −2 −1 0 1 2 3
Null 1.94
Since the obtained value reaches In Example 8.4, we will compute effect size for the research study in Examples 8.1
Step 2: Set the criteria for a decision. The level of significance is .05, which makes the the rejection region, we decide to to 8.3. Since we tested the same population and measured the same sample mean
alpha level α = .05. To locate the probability of obtaining a sample mean from a given reject the null hypothesis. in each example, the effect size estimate will be the same for all examples.

CHAPTER 8 : I N T ROD U CTI O N T O H Y POT H ESI S T ES TI N G 13 18 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS CH APTER 8: INTROD UCTIO N TO HYPO THES IS TESTIN G 23

population, we use the standard normal distribution. We will locate the z scores in a population is not 558, which was the value stated in the null hypothesis. Also, In Examples 8.1 to 8.3, we used data given by Templer and Tomeo (2002). They
standard normal distribution that are the cutoffs, or critical values, for sample mean notice that we made two different decisions using the same data in Examples 8.1 reported that the population mean on the quantitative portion of the GRE General E X A M PL E 8 . 4
values with less than a 5% probability of occurrence if the value stated in the null and 8.2. This outcome is explained further in Section 8.6. Test for those taking the exam between 1994 and 1997 was 558 ± 139 (µ ± σ). In
(µ = 558) is true. each example, the mean test score in the sample was 585 (M = 585). What is the
effect size for this test using Cohen’s d?
A critical value is a cutoff value that defines the boundaries beyond which DIRECTIONAL, LOWER-TAIL CRITICAL
less than 5% of sample means can be obtained if the null hypothesis is true. DEFINITION HYPOTHESIS TESTS (H1: <) The numerator for Cohen’s d is the difference between the sample mean (M = 585)
Sample means obtained beyond a critical value will result in a decision to NOTE: A lower-tail critical test
and the population mean (µ = 558). The denominator is the population standard
reject the null hypothesis. In Example 8.3, we will use the z test for a directional, or one-tailed test, where the deviation (σ = 139):
is conducted when it is not
alternative hypothesis is stated as less than (<) the null hypothesis. For a lower-tail
possible or highly unlikely that M −µ 27
critical test, or a less than statement, we place the level of significance or critical d= = 0.19.
In a nondirectional two-tailed test, we divide the alpha value in half so that an a sample mean will fall above
=
value in the lower tail of the sampling distribution. So we are interested in any alter- σ 139
equal proportion of area is placed in the upper and lower tail. Table 8.4 gives the the population mean stated in
native less than the value stated in the null hypothesis. This test is appropriate
critical values for one- and two-tailed tests at a .05, .01, and .001 level of significance. the null hypothesis.
when it is not possible or highly unlikely that a sample mean will fall above the We conclude that the observed effect shifted 0.19 standard deviations above
Figure 8.4 displays a graph with the critical values for Example 8.1 shown. In this
population mean stated in the null hypothesis. the mean in the population. This way of interpreting effect size is illustrated in
example α = .05, so we split this probability in half:
Figure 8.11. We are stating that students in the elite school scored 0.19 standard
α .05 Using the same study from Example 8.1, Templer and Tomeo (2002) reported that deviations higher, on average, than students in the general population. This
Splitting α in half: = = .0250 in each tail E X A M PL E 8 . 3
2 2 the population mean on the quantitative portion of the GRE General Test for those interpretation is most meaningfully reported with Example 8.2 since we decided to
taking the exam between 1994 and 1997 was 558 ± 139 (µ ± σ). Suppose we select a reject the null hypothesis using this example. Table 8.7 compares the basic
sample of 100 students enrolled in a school with low funding and resources (n = 100). characteristics of hypothesis testing and effect size.
We hypothesize that students at this school will score lower than the general
TABLE 8.4 Critical values for one- and two-tailed tests at three commonly used levels of significance. population. We record a sample mean equal to 585 (M = 585), same as measured in
Examples 8.1 and 8.2. Compute the one–independent sample z test at a .05 level of
significance.
Type of Test d = 0.19
Level of Significance (α) One-Tailed Two-Tailed Step 1: State the hypotheses. The population mean is 558, and we are testing
whether the alternative is less than (<) this value:
0.05 +1.645 or −1.645 ±1.96 Population distribution
H0: µ = 558 Mean test scores are equal to 558 in the population at this assuming the null is true
0.01 +2.33 or −2.33 ±2.58 school.
µ = 558
0.001 +3.09 or −3.09 ±3.30 H1: µ < 558 Mean test scores are less than 558 in the population at this σ = 139
school.

Step 2: Set the criteria for a decision. The level of significance is .05, which makes
NOTE: For one-tailed tests, the
the alpha level α = .05. To determine the critical value for a lower-tail critical test, we
alpha level is placed in a single
locate the probability .0500 toward the tail in column C in the unit normal table.
Critical values for a nondirectional tail of the distribution. For
The z-score associated with this probability is again z = 1.645. Since this test is a
(two-tailed) test with α = .05 lower-tail critical tests, the
lower-tail critical test, we place the critical value the same distance below the mean: 141 280 419 558 697 836 975
alpha is placed below the
The critical value for this test is z = –1.645. All of the alpha level is placed in the
mean in the lower tail.
lower tail of the distribution beyond the critical value. Figure 8.8 shows the standard
Population distribution
normal distribution, with the rejection region beyond the critical value.
Rejection region Rejection region assuming the null is false—
α = .0250 α = .0250 with a 2-point effect
Step 3: Compute the test statistic. Step 2 sets the stage for making a decision because FIGURE 8.11
the criterion is set. The probability is less than 5% that we will obtain a sample
µ = 585
mean that is at least 1.645 standard deviations below the value of the population Effect size. Cohen’s d estimates
σ = 139
mean stated in the null hypothesis. In this step, we will compute a test statistic to the size of an effect using the
determine whether or not the sample mean we selected is beyond the critical value population standard deviation as
−3 −2 −1 0 1 2 3 FIGURE 8.4 an absolute comparison.
we stated in Step 2.
Null A 27-point effect shifted the
−1.96 1.96 The critical values (±1.96) for a 168 307 446 585 724 863 1002 distribution of scores in the
nondirectional (two-tailed) test The test statistic does not change from that used in Example 8.1. We are testing the population by 0.19 standard
with a .05 level of significance. same population, and we measured the same value of the sample mean. We changed deviations.

14 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 19 24 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

NOTE: For two-tailed tests, To locate the critical values, we use the unit normal table given in Table B1 in Appendix TABLE 8.7 Distinguishing characteristics for significance testing and effect size.
the alpha is split in half B and look up the proportion .0250 toward the tail in column C. This value, .0250, is
Critical value for an lower-
and placed in each tail of a listed for a z-score equal to z = 1.96. This is the critical value for the upper tail of the
tail critical test with α = .05
standard normal distribution. standard normal distribution. Since the normal distribution is symmetrical, the critical Hypothesis
value in the bottom tail will be the same distance below the mean, or z = –1.96. The (Significance) Testing Effect Size (Cohen’s d )
NOTE: A critical value
regions beyond the critical values, displayed in Figure 8.4, are called the rejection
marks the cutoff for the Rejection region Value being measured? p value d
regions. If the value of the test statistic falls in these regions, then the decision is to
rejection region. α = .05
reject the null hypothesis; otherwise, we retain the null hypothesis.
What type of distribution is the Sampling distribution Population distribution
FIGURE 8.8
test based upon?
The rejection region is the region beyond a critical value in a hypothesis test. The critical value (−1.645) for a
DEFINITION When the value of a test statistic is in the rejection region, we decide to reject directional (lower-tail critical) test What does the test measure? The probability of obtaining a The size of a measured treatment
the null hypothesis; otherwise, we retain the null hypothesis. at a .05 level of significance. measured sample mean effect in the population
−3 −2 −1 0 1 2 3 When the test statistic is less than
Null −1.645, we reject the null What can be inferred from the Whether the null hypothesis is Whether the size of a treatment
Step 3: Compute the test statistic. Step 2 sets the stage for making a decision because the z = −1.645
hypothesis; otherwise, we retain test? true or false effect is small to large
criterion is set. The probability is less than 5% that we will obtain a sample mean that is at
the null hypothesis.
least 1.96 standard deviations above or below the value of the population mean stated in Can this test stand alone in Yes, the test statistic can be No, effect size is almost always
the null hypothesis. In this step, we will compute a test statistic to determine whether the research reports? reported without an effect size reported with a test statistic
sample mean we selected is beyond or within the critical values we stated in Step 2. only the location of the rejection region in Step 2. The z statistic is the same
computation as that shown in Example 8.1:
The test statistic for a one–independent sample z test is called the z statistic. The M − µ 585 − 558
z statistic converts any sampling distribution into a standard normal distribution. zobt = = = 1.94 .
σM 13.9
The z statistic is therefore a z transformation. The solution of the formula gives the LEARNING
number of standard deviations, or z-scores, that a sample mean falls above or below C H EC K 6 1. ________ measures the size of an effect in a population, whereas ______________
the population mean stated in the null hypothesis. We can then compare the value Step 4: Make a decision. To make a decision, we compare the obtained value to the measures whether an effect exists in a population.
of the z statistic, called the obtained value, to the critical values we determined in critical value. We reject the null hypothesis if the obtained value exceeds the critical
value. Figure 8.9 shows that the obtained value (Zobt = +1.94) does not exceed the 2. The scores for a population are normally distributed with a mean equal to 25
Step 2. The z statistic formula is the sample mean minus the population mean
critical value. Instead, the value we obtained is located in the opposite tail. The and standard deviation equal to 6. A researcher selects a sample of 36 students
stated in the null hypothesis, divided by the standard error of the mean:
decision is to retain the null hypothesis. and measures a sample mean equal to 23 (M = 23). For this example:
M −µ σ a. What is the value of Cohen’s d?
z statistic: zobt = , where σ M = .
σM n b. Is this effect size small, medium, or large?

The test statistic does not reach the


6
The z statistic is an inferential statistic used to determine the number of 23 − 25
= −0.33, (b) Medium effect size. Answers: 1. Effect size, hypothesis or significance testing; 2. (a) d =
DEFINITION standard deviations in a standard normal distribution that a sample mean rejection region; retain the null
deviates from the population mean stated in the null hypothesis. hypothesis.
This is actually a Type III error—this
The obtained value is the value of a test statistic. This value is compared to result would have been significant if
the critical value(s) of a hypothesis test to make a decision. When the obtained the rejection region were placed in
value exceeds a critical value, we decide to reject the null hypothesis; the upper tail. 8.8 EFFECT SIZE, POWER, AND SAMPLE SIZE
otherwise, we retain the null hypothesis.
One advantage of knowing effect size, d, is that its value can be used to determine
To calculate the z statistic, first compute the standard error (σM), which is the the power of detecting an effect in hypothesis testing. The likelihood of detecting
denominator for the z statistic: an effect, called power, is critical in behavioral research because it lets the researcher
NOTE: The z statistic Rejection region
measures the number of 139 α = .05 know the probability that a randomly selected sample will lead to a decision to
σ
= 13.9. Retain the null
standard deviations, or
σM = = reject the null hypothesis, if the null hypothesis is false. In this section, we describe
n 100 hypothesis
z-scores, that a sample mean
how effect size and sample size are related to power.
falls above or below the Then compute the z statistic by substituting the values of the sample mean,
FIGURE 8.9
population mean stated in the M = 585; the population mean stated by the null hypothesis, µ = 558; and the
null hypothesis. standard error we just calculated, σM = 13.9: −3 −2 −1 0 1 2 3 Since the obtained value does not THE RELATIONSHIP BETWEEN EFFECT SIZE AND POWER
Null reach the rejection region, we
M − µ 585 − 558 1.94 As effect size increases, power increases. To illustrate, we will use a random sample
zobt = = = 1.94 . decide to retain the null
σM 13.9 hypothesis. of quiz scores in two statistics classes shown in Table 8.8. Notice that only the

CHAPTER 8 : I N T ROD U CTI O N T O H Y POT H ESI S T ES TI N G 15 20 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS CH APTER 8: INTROD UCTIO N TO HYPO THES IS TESTIN G 25

Step 4: Make a decision. To make a decision, we compare the obtained value to the NOTE: A Type III error occurs The decision in Example 8.3 was to retain the null hypothesis, although if we standard deviation differs between these populations. Using the values given in
critical values. We reject the null hypothesis if the obtained value exceeds a critical when the rejection region is placed the rejection region in the upper tail (as we did in Example 8.2), we would Table 8.8, we already have enough information to compute effect size:
value. Figure 8.5 shows that the obtained value (Zobt = 1.94) is less than the critical value; located in the wrong tail. This have decided to reject the null hypothesis. We anticipated that scores would be
it does not fall in the rejection region. The decision is to retain the null hypothesis. type of error is only possible for worse, and instead, they were better than the value stated in the null hypothesis.
one-tailed tests. When we fail to reject the null hypothesis because we placed the rejection region in TABLE 8.8 Characteristics for two hypothetical
the wrong tail, we commit a Type III error (Kaiser, 1960). populations of quiz scores.

The obtained value is 1.94,


A Type III error occurs with one-tailed tests, where the researcher decides to retain
which fails to reach the cutoff DEFINITION the null hypothesis because the rejection region was located in the wrong tail. Class 1 Class 2
for the rejection region; retain
The “wrong tail” refers to the opposite tail from where a difference was
the null hypothesis. M1 = 40 M2 = 40
observed and would have otherwise been significant.

Rejection region Rejection region µ1 = 38 µ2 = 38


α = .0250 α = .0250
Retain the null
8.6 RESEARCH IN FOCUS: DIRECTIONAL σ1 = 10 σ2 = 2
hypothesis
FIGURE 8.5
VERSUS NONDIRECTIONAL TESTS
Kruger and Savitsky (2006) conducted a study in which they performed two tests on the
Since the obtained value fails to
reach the rejection region (it is same data. They completed an upper-tail critical test at α = .05 and a two-tailed test at
−3 −2 −1 0 1 2 3 M − µ 40 − 38
within the critical values of ±1.96), α = .10. A shown in Figure 8.10, these are similar tests, except in the upper-tail test, all Effect size for Class 1 : d = = = 0.20.
−1.96 Null we decide to retain the null the alpha level is placed in the upper tail, and in the two-tailed test, the alpha level is 10
1.94 σ
hypothesis. split so that .05 is placed in each tail. When the researchers showed these results to a
group of participants, they found that participants were more persuaded by a significant M − µ 40 − 38
Effect size for Class 2 : d = = = 1.00.
result when it was described as a one-tailed test, p < .05, than when it was described as a σ 10
The probability of obtaining Zobt = 1.94 is stated by the p value. To locate the p value
two-tailed test, p < .10. This was interesting because the two results were identical—
or probability of obtaining the z statistic, we refer to the unit normal table in
both tests were associated with the same critical value in the upper tail. The numerator for each effect size estimate is the same. The mean difference
Table B1 in Appendix B. Look for a z score equal to 1.94 in column A, then locate
Most editors of peer-reviewed journals in behavioral research will not publish between the sample mean and the population mean is 2 points. Although there is a
the probability toward the tail in column C. The value is .0262. Finally, multiply the
the results of a study where the level of significance is greater than .05. Although 2-point effect in both Class 1 and Class 2, Class 2 is associated with a much larger
value given in column C times the number of tails for alpha. Since this is a two-
the two-tailed test, p < .10, was significant, it is unlikely that the results would be effect size in the population because the standard deviation is smaller. Since a larger
tailed test, we multiply .0262 times 2: p = (.0262) × 2 tails = .0524. Table 8.5
published in a peer-reviewed scientific journal. Reporting the same results as a one- effect size is associated with greater power, we should find that it is easier to detect
summarizes how to determine the p value for one- and two-tailed tests. (We will
tailed test, p < .05, makes it more likely that the data will be published. the 2-point effect in Class 2. To determine whether this is true, suppose we select a
compute one-tailed tests in Examples 8.2 and 8.3.)
sample of 30 students (n = 30) from each class and measure the same sample mean
value that is listed in Table 8.8. Let’s determine the power of each test when we
TABLE 8.5 To find the p value for the z statistic, find its probability (toward the tail) in the unit normal Upper-tail critical test at a Two-tailed test at a .10 conduct an upper-tail critical test at a .05 level of significance.
table and multiply this probability times the number of tails for alpha. .05 level of significance level of significance To determine the power, we will first construct the sampling distribution for each
σ
class, with a mean equal to the population mean and standard error equal to :
n
One-Tailed Test Two-Tailed Test
Sampling distribution for Class 1: Mean: µM = 38
Number of tails 1 2
Standard error: σ = 10 = 1.82
Probability p p n 30
p value calculation 1p 2p Sampling distribution for Class 2: Mean: µM = 38
FIGURE 8.10
Standard error: σ = 2 = 0.37
When α = .05, all of that −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
n 30
value is placed in the Null z = 1.645 z = −1.645 Null z = 1.645
We found in Example 8.1 that if the null hypothesis were true, then p = .0524
upper tail for an upper-tail
that we could have selected this sample mean from this population. The criteria we If the null hypothesis is true, then the sampling distribution of the mean for
critical test. The two-
set in Step 2 was that the probability must be less than 5% that we obtain a sample tailed equivalent would The upper critical alpha (α), the type of error associated with a true null hypothesis, will have a mean
mean, if the null hypothesis were true. Since p is greater than 5%, we decide to require a test with value is the same equal to 38. We can now determine the smallest value of the sample mean that is
retain the null hypothesis. We conclude that the mean score on the GRE General α = .10, such that .05 is for both tests the cutoff for the rejection region, where we decide to reject that the true population
Test in this population is 558 (the value stated in the null hypothesis). placed in each tail. mean is 38. For an upper-tail critical test using a .05 level of significance, the critical
26 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 31 36 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

value is 1.645. We can use this value to compute a z transformation to determine precisely, |Z| > 1.96)” (p. 464). Based on this all tests became more powerful as sample size
what sample mean value is 1.645 standard deviations above 38 in a sampling 650 description: increased” (p. 468). How did increasing the sam-
distribution for samples of size 30: a. Are the authors referring to critical values for a ple size in this study increase power?
600
one- or two-tailed z test?
550 34. Describing hypothesis testing. Blouin and
M − 38 b. What alpha level are the authors referring to? Riopelle (2004) made the following statement
Cutoff for α (Class 1): 1.645 = 500
1.82 concerning how scientists select test statistics:
450
M = 40.99 33. Sample size and power. Collins and Morris “[This] test is the norm for conducting a test of H0,
400 (2008) simulated selecting thousands of samples when . . . the population(s) are normal with
M − 38

Score
Cutoff for α (Class 2): 1.645 = 350 and analyzed the results using many different known variance(s)” (p. 78). Based on this descrip-
0.37 300 test statistics. With regard to the power for these tion, what test statistic are they describing as the
M = 38.61 250 samples, they reported that “generally speaking, norm? How do you know this?
200
If we obtain a sample mean equal to 40.99 or higher in Class 1, then we will
150
reject the null hypothesis. If we obtain a sample mean equal to 38.61 or higher in
100 FIGURE 8.14
Class 2, then we will reject the null hypothesis. To determine the power for this test,
we assume that the sample mean we selected (M = 40) is the true population mean— 50 The mean Graduate Record
we are therefore assuming that the null hypothesis is false. We are asking the 0 Examination (GRE) General Test
following question: If we are correct and there is a 2-point effect, then what is the General Population Gifted Students scores among a sample of gifted
probability that we will detect the effect? In other words, what is the probability Population and Sample students compared with the
general population. Error bars
that a sample randomly selected from this population will lead to a decision to
indicate SEM.
reject the null hypothesis?
If the null hypothesis is false, then the sampling distribution of the mean for β,
the type of error associated with a false null hypothesis, will have a mean equal to In two sentences and a figure, we reported the value of the test statistic,
40. This is what we believe is the true population mean, and this is the only change; p value, effect size, and the mean test scores. The error bars indicate the standard
NOTE: As the size of an effect we do not change the standard error. Figure 8.12 shows the sampling distribution error of the mean for this study.
increases, the power to detect for Class 1, and Figure 8.13 shows the sampling distribution for Class 2, assuming
the effect also increases. the null hypothesis is correct (top graph) and assuming the 2-point effect exists
(bottom graph).

Sampling distribution Sample means in this region have


assuming the null is true less than a 5% chance of
µ = 38 occurrence, if the null is true.
SEM = 1.82 Probability of a Type I error = .05
n = 30

FIGURE 8.12

Small effect size and low power 32.54 34.36 36.18 38 39.82 41.64 43.46
for Class 1. In this example, when Sampling distribution
alpha is .05, the critical value or About 29% of sample means
assuming the null is false—
cutoff for alpha is 40.99. When selected from this population will
with a 2-point effect
α = .05, notice that only about result in a decision to reject the
29% of samples will detect this µ = 40 null, if the null is false.
effect (the power). So even if the SEM = 1.82 Power = .2946
researcher is correct, and the null n = 30
is false (with a 2-point effect), only
about 29% of the samples he or
34.54 36.36 38.18 40 41.82 43.64 45.46
she selects at random will result
40.99
in a decision to reject the null
hypothesis.

CHAPTER 8 : I N T ROD U CTI O N T O H Y POT H ESI S T ES TI N G 27 32 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

C H A P T E R SU M M A RY O R G AN I Z E D BY LE A R N I N G O B J EC T I V E
Sample means in this region have
Sampling distribution less than a 5% chance of FIGURE 8.13
assuming the null is true
µ = 38
SEM = 0.37
n = 30
occurrence, if the null is true.
Probability of a Type I error = .05

Almost 100% of sample means


Large effect size and high
power for Class 2. In this
example, when alpha is .05,
LO 1: Identify the four steps of hypothesis testing.

• Hypothesis testing, or significance test-


When a null hypothesis is retained, a result is
not significant. APPENDIX C
collected from this population the critical value or cutoff
ing, a method of testing a claim or hypothesis LO 3: Define Type I error and Type II error, and iden-
will result in a decision to reject about a parameter in a population, using data tify the type of error that researchers control.
for alpha is 38.61. When
36.89 37.26 37.63 38 38.37 38.74 39.11 the null, if the null is false. measured in a sample. In this method, we test
α = .05, notice that
Sampling distribution
assuming the null is false—
Power = .9999
practically any sample will
detect this effect (the
power). So if the researcher
some hypothesis by determining the likeli-
hood that a sample statistic could have been
• We can decide to retain or reject the null
hypothesis, and this decision can be correct or Chapter Solutions
with a 2-point effect selected, if the hypothesis regarding the popu- incorrect. Two types of errors in hypothesis
µ = 40
SEM = 0.37
n = 30
is correct, and the null is
false (with a 2-point effect),
nearly 100% of the samples
lation parameter were true. The four steps of
hypothesis testing are as follows:
testing are called Type I and Type II errors.
• A Type I error is the probability of rejecting
for Even-Numbered
38.61
38.89 39.26 39.63 40 40.37 40.74 41.11 he or she selects at random
will result in a decision to
reject the null hypothesis.
– Step 1: State the hypotheses.
– Step 2: Set the criteria for a decision.
– Step 3: Compute the test statistic.
a null hypothesis that is actually true. The
probability of this type of error is determined
by the researcher and stated as the level of sig-
End-of-Chapter Problems
– Step 4: Make a decision. nificance or alpha level for a hypothesis test.
• A Type II error is the probability of retaining
If we are correct, and the 2-point effect exists, then we are much more likely to a null hypothesis that is actually false.
LO 2: Define null hypothesis, alternative hypothesis,
detect the effect in Class 2 for n = 30. Class 1 has a small effect size (d = .20). Even if
level of significance, test statistic, p value, and statisti-
we are correct, and a 2-point effect does exist in this population, then of all the
cal significance. LO 4: Calculate the one–independent sample z test CHAPTER 8
samples of size 30 we could select from this population, only about 29% (power =
and interpret the results.
.2946) of those samples will show the effect (i.e., lead to a decision to reject the
• The null hypothesis (H 0), stated as the
null). The probability of correctly rejecting the null hypothesis (power) is low. 2. Reject the null hypothesis and retain the null 18.
null, is a statement about a population • The one–independent sample z test is a
Class 2 has a large effect size (d = 1.00). If we are correct, and a 2-point effect hypothesis. a. α = .05.
parameter, such as the population mean, that statistical procedure used to test hypotheses
does exist in this population, then of all the samples of size 30 we could select from b. α = .01.
is assumed to be true. concerning the mean in a single population 4. A Type II error is the probability of retaining a
this population, nearly 100% (power = .9999) of those samples will show the effect
• An alternative hypothesis (H 1 ) is a with a known variance. The test statistic for null hypothesis that is actually false. c. α = .001.
(i.e., lead to a decision to reject the null hypothesis). Hence, we have more power to
statement that directly contradicts a null this hypothesis test is 20.
detect an effect in this population, and correctly reject the null hypothesis.
hypothesis by stating that the actual value of a 6. Critical values = ±1.96. 1a. Reject the null hypothesis.
population parameter, such as the mean, is M −µ σ
zobt = , where σ M = . 8. All four terms describe the same thing. The 1b. Reject the null hypothesis.
less than, greater than, or not equal to the σM n
THE RELATIONSHIP BETWEEN SAMPLE SIZE AND POWER level of significance is represented by alpha, 1c. Reject the null hypothesis.
value stated in the null hypothesis.
which defines the rejection region or the region 1d. Retain the null hypothesis.
To overcome low effect size, we can increase the sample size. Increasing sample size • Level of significance refers to a criterion of • Critical values, which mark the cutoffs for
associated with the probability of committing a 2a. Retain the null hypothesis.
decreases standard error, thereby increasing power. To illustrate, let’s compute the judgment upon which a decision is made the rejection region, can be identified for
Type I error. 2b. Retain the null hypothesis.
test statistic for the one-tailed significance test for Class 1, which had a small effect regarding the value stated in a null hypothesis. any level of significance. The value of the test
size. The data for Class 1 are given in Table 8.8 for a sample of 30 participants. The • The test statistic is a mathematical formula statistic is compared to the critical values. 10. Alpha level, sample size, and effect size. 2c. Reject the null hypothesis.
test statistic for Class 1 when n = 30 is: that allows researchers to determine the likeli- When the value of a test statistic exceeds a 2d. Reject the null hypothesis.
hood or probability of obtaining sample out- critical value, we reject the null hypothesis; 12. In hypothesis testing, the significance of an effect
determines whether an effect exists in some pop- 22.
M − µ 40 − 38 comes if the null hypothesis were true. The otherwise, we retain the null hypothesis. 7 74 − 72
zobt = = = 1.10. ulation. Effect size is used as a measure for how a. σ M = = 1.0; hence, zobt = = 2.00.
σ 10 value of a test statistic can be used to make 49 1
inferences concerning the value of population LO 5: Distinguish between a one-tailed and two- big the effect is in the population.
n 30 The decision is to reject the null hypothesis.
parameters stated in the null hypothesis. tailed test, and explain why a Type III error is possible 14. All decisions are made about the null hypothesis 74 − 72
For a one-tailed test that is upper-tail critical, the critical value is 1.645. The • A p value is the probability of obtaining a sam- only with one-tailed tests. and not the alternative hypothesis. The only b. d = = .29. A medium effect size.
ple outcome, given that the value stated in the 7
value of the test statistic (+1.10) does not exceed the critical value (+1.645), so we appropriate decisions are to retain or reject the 24.
retain the null hypothesis. NOTE: Increasing the sample null hypothesis is true. The p value of a sample • Nondirectional (two-tailed) tests are null hypothesis. 0.05
a. d = = 0.125. A small effect size.
Increase the sample size to n = 100. The test statistic for Class 1 when n = 100 is: size increases power by outcome is compared to the level of significance. hypothesis tests where the alternative hypothe- 0.4
reducing the standard error, • Significance, or statistical significance, sis is stated as not equal to (≠). So we are interested 16. The sample size in the second sample was larger.
Therefore, the second sample had more power to b. d = 0.1 = 0.25. A medium effect size.
M − µ 40 − 38 thereby increasing the value of describes a decision made concerning a value in any alternative from the null hypothesis. 0.4
zobt = = = 2.00. detect the effect, which is likely why the deci-
σ 10 the test statistic in hypothesis stated in the null hypothesis. When a null • Directional (one-tailed) tests are hypoth- 0.4
n 100 testing. hypothesis is rejected, a result is significant. esis tests where the alternative hypothesis is sions were different. c. d = = 1.00. A large effect size.
0.4

37

28 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 33 38 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

The critical value is still 1.645. The value of the test statistic (+2.00) now exceeds stated as greater than (>) or less than (<) some medium, and large effects based on typical 26. 30. The point Good and Hardin (2003) are making is
the critical value (+1.645), so we reject the null hypothesis. value. So we are interested in a specific alterna- findings in behavioral research. 1 that it is possible with the same data to retain the
a. d = = 1.00. Large effect size.
Notice that increasing the sample size alone led to a decision to reject the tive from the null hypothesis. 1 null for a two-tailed test and reject the null for a
null hypothesis. Hence, increasing sample size increases power: It makes it • A Type III error occurs for one-tailed tests LO 7: Define power and identify six factors that influ- 1 one-tailed test where the entire rejection region is
b. d = = 0.50. Medium effect size.
more likely that we will detect an effect, assuming that an effect exists in some where a result would have been significant in ence power. 2 placed in a single tail.
population. one tail, but the researcher retains the null 1
c. d = = 0.25. Medium effect size. 32.
hypothesis because the rejection region was • The power in hypothesis testing is the prob- 4
placed in the wrong or opposite tail. ability that a randomly selected sample will 1 a. Two-tailed z test.
d. d = = .17. Small effect size.
show that the null hypothesis is false when 6 b. α = .05.
LEARNING the null hypothesis is in fact false.
LO 6: Explain what effect size measures and compute
C H EC K 7 1. As effect size increases, what happens to the power? 28. This will decrease standard error, thereby increas- 34. We would use the z test because the population
a Cohen’s d for the one–independent sample z test. • To increase the power of detecting an effect in
ing power. variance is known.
2. As effect size decreases, what happens to the power? a given population:
• Effect size is a statistical measure of the size of a. Increase effect size (d), sample size (n), and
3. When a population is associated with a small effect size, what can a researcher an observed effect in a population, which allows alpha (α).
do to increase the power of the study? researchers to describe how far scores shifted in b. Decrease beta error (β), population standard
the population, or the percent of variance that deviation (σ), and standard error (σM).
4. True or false: The effect size, power, and sample size associated with a study can
can be explained by a given variable.
affect the decisions we make in hypothesis testing.
• Cohen’s d is used to measure how far scores APA LO 8: Summarize the results of a one–indepen-
Answers: 1. Power increases; 2. Power decreases; 3. Increase the sample size (n); 4. True. shifted in a population and is computed using dent sample z test in American Psychological
the following formula: Association (APA) format.

M −µ • To report the results of a z test, we report the


Cohen’s d = .
σ test statistic, p value, and effect size of a
8.9 ADDITIONAL FACTORS THAT INCREASE POWER hypothesis test. In addition, a figure or table is
• To interpret the size of an effect, we refer to usually provided to summarize the means and
Cohen’s effect size conventions, which standard error or standard deviation measured
The power is the likelihood of detecting an effect. Behavioral research often requires a are standard rules for identifying small, in a study.
great deal of time and money to select, observe, measure, and analyze data. And the
institutions that supply the funding for research studies want to know that they are
spending their money wisely and that researchers conduct studies that will show KEY TERMS
results. Consequently, to receive a research grant, researchers are often required to
state the likelihood that they will detect the effect they are studying, assuming they alpha (α) hypothesis testing significance
are correct. In other words, researchers must disclose the power of their study. alternative hypothesis (H1) level of significance significance testing
The typical standard for power is .80. Researchers try to make sure that at beta (β) error nondirectional (two-tailed) tests statistical significance
least 80% of the samples they select will show an effect when an effect exists in Cohen’s d null test statistic
a population. In Section 8.8, we showed that increasing effect size and sample Cohen’s effect size conventions null hypothesis (H0) Type I error
size increases power. In this section, we introduce four additional factors that critical values obtained value Type II error
influence power. directional (one-tailed) tests one–independent sample z test Type III error
effect power z statistic
effect size p value
INCREASING POWER: INCREASE hypothesis rejection region
EFFECT SIZE, SAMPLE SIZE, AND ALPHA
END-OF- CHAPTER PROBLEMS
Increasing effect size, sample size, and the alpha level will increase power.
NOTE: To increase power: Section 8.8 showed that increasing effect size and sample size increases power;
increase effect size, sample here we discuss increasing alpha. The alpha level is the probability of a Type I Factual Problems 4. What is a Type II error (β)?
size, and alpha; decrease beta, error; it is the rejection region for a hypothesis test. The larger the rejection
region, the greater the likelihood of rejecting the null hypothesis, and the greater 1. State the four steps of hypothesis testing. 5. What is the power in hypothesis testing?
population standard deviation,
and standard error. the power will be. This was illustrated by the difference in the decisions made for 2. What are two decisions that a researcher makes
Examples 8.1 and 8.2. Increasing the size of the rejection region in the upper tail in hypothesis testing? 6. What are the critical values for a one–independent
in Example 8.2 increased the power to detect the 27-point effect. This is why sample nondirectional (two-tailed) z test at a .05
one-tailed tests are more powerful than two-tailed tests: They increase alpha in 3. What is a Type I error (α)? level of significance?

CHAPTER 8 : I N T ROD U CTI O N T O H Y POT H ESI S T ES TI N G 29 34 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS

the direction that an effect is expected to occur, thereby increasing the power to 7. Explain why a one-tailed test is associated with 17. A researcher conducts a one–independent sample
detect an effect. greater power than a two-tailed test. z test and makes the decision to reject the null
hypothesis. Another researcher selects a larger
8. How are the rejection region, probability of a sample from the same population, obtains the
INCREASING POWER: DECREASE BETA, Type I error, level of significance, and alpha level same sample mean, and makes the decision to
STANDARD DEVIATION (σ), AND STANDARD ERROR related? retain the null hypothesis using the same
hypothesis test. Is this possible? Explain.
Decreasing three factors can increase power. Decreasing beta error (β) increases power. 9. Alpha (α) is used to measure the error for deci-
In Table 8.3, β is given as the probability of a Type II error, and 1 − β is given as the sions concerning true null hypotheses. What is 18. Determine the level of significance for a hypothe-
power. So the lower β is, the greater the solution will be for 1 − β. For example, say beta (β) error used to measure? sis test in each of the following populations given
β = .20. In this case, 1 − β = (1 − .20) = .80. If we decreased β, say, to β = .10, the power the specified standard error and critical values.
10. What three factors can be increased to increase
will increase: 1 − β = (1 − .10) = .90. Hence, decreasing beta error increases power. Hint: Refer to the values given in Table 8.4:
power?
Decreasing the population standard deviation (σ) and standard error (σM) will a. µ = 100, σM = 8, critical values: 84.32 and 115.68

Data Pre-processing in
also increase power. The population standard deviation is the numerator for com- 11. What three factors can be decreased to increase
b. µ = 100, σM = 6, critical value: 113.98
puting standard error. Decreasing the population standard deviation will decrease power?
c. µ = 100, σM = 4, critical value: 86.8
the standard error, thereby increasing the value of the test statistic. To illustrate, 12. Distinguish between the significance of a result
suppose that we select a sample from a population of students with quiz scores and the size of an effect. 19. For each p value stated below: (1) What is the
equal to 10 ± 8 (µ ± σ). We select a sample of 16 students from this population and

Machine learning
decision for each if α = .05? (2) What is the deci-
measure a sample mean equal to 12. In this example, the standard error is: Concepts and Application Problems sion for each if α = .01?

σ 8 a. p = .1000
σM = = = 2.0. 13. Explain why the following statement is true: The
n 16 b. p = .0250
population standard deviation is always larger
than the standard error when the sample size is c. p = .0050
To compute the z statistic, we subtract the sample mean from the population greater than one (n > 1). d. p = .0001
mean and divide by the standard error:
14. A researcher conducts a hypothesis test and con- 20. For each obtained value stated below: (1) What is
M − µ 12 − 10 cludes that his hypothesis is correct. Explain why the decision for each if α = .05 (one-tailed test,
zobt = = = 1.00. this conclusion is never an appropriate decision upper-tail critical)? (2) What is the decision for
σM 2
in hypothesis testing. each if α = .01 (two-tailed test)?
An obtained value equal to 1.00 does not exceed the critical value for a one- a. zobt = 2.10
15. The weight (in pounds) for a population of
tailed test (critical value = 1.645) or a two-tailed test (critical values = ±1.96). The school-aged children is normally distributed b. zobt = 1.70
decision is to retain the null hypothesis. with a mean equal to 135 ± 20 pounds (µ ± σ). c. zobt = 2.75
If the population standard deviation is smaller, the standard error will be Suppose we select a sample of 100 children (n = d. zobt = –3.30
smaller, thereby making the value of the test statistic larger. Suppose, for example, 100) to test whether children in this population
that we reduce the population standard deviation to 4. The standard error in this are gaining weight at a .05 level of significance. 21. Will each of the following increase, decrease, or
example is now: have no effect on the value of a test statistic for

Acknowledgements:
a. What are the null and alternative hypotheses?
σ 4 the one–independent sample z test?
σM = = = 1.0. b. What is the critical value for this test?
n 16 a. The sample size is increased.
c. What is the mean of the sampling distribution?

Dr Vijay Kumar, IT Dept, NIT Jalandhar


b. The population variance is decreased.
To compute the z statistic, we subtract the sample mean from the population d. What is the standard error of the mean for the
mean and divide by this smaller standard error: sampling distribution? c. The sample variance is doubled.
d. The difference between the sample mean and
16. A researcher selects a sample of 30 participants population mean is decreased.
M − µ 12 − 10 and makes the decision to retain the null hypoth-
zobt = = = 2.00.
σM 1 esis. She conducts the same study testing the 22. The police chief selects a sample of 49 local police
same hypothesis with a sample of 300 partici- officers from a population of officers with a mean
An obtained value equal to 2.00 does exceed the critical value for a one-tailed pants and makes the decision to reject the null physical fitness rating of 72 ± 7.0 (µ ± σ) on a
test (critical value = 1.645) and a two-tailed test (critical values = ±1.96). Now the hypothesis. Give a likely explanation for why the 100-point physical endurance rating scale. He
decision is to reject the null hypothesis. Assuming that an effect exists in the two samples led to different decisions. measures a sample mean physical fitness rating on
population, decreasing the population standard deviation decreases standard error
and increases the power to detect an effect. Table 8.9 lists each factor that increases
power.

30 PART III: PROBABILITY AND THE FOUNDATIONS OF INFERENTIAL STATISTICS C HA PTE R 8 : IN TRODUC TIO N TO HY POTHES IS TES TI NG 35

TABLE 8.9 A summary of factors that increase power—the probability of this scale equal to 74. He conducts a one–independent 27. As α increases, so does the power to detect an
rejecting a false null hypothesis. sample z test to determine whether physical effect. Why, then, do we restrict α from being
endurance increased at a .05 level of significance. larger than .05?
a. State the value of the test statistic and whether
To increase power: 28. Will increasing sample size (n) and decreasing the
to retain or reject the null hypothesis.
population standard deviation (σ) increase or
Increase Decrease b. Compute effect size using Cohen’s d. decrease the value of standard error? Will this

Topics
increase or decrease power?
d (Effect size) β (Type II error) 23. A cheerleading squad received a mean rating (out
n (Sample size) σ (Standard deviation) of 100 possible points) of 75 ± 12 (µ ± σ) in com- Problems in Research
petitions over the previous three seasons. The
α (Type I error) σM (Standard error) same cheerleading squad performed in 36 local 29. Directional vs. nondirectional hypothesis
competitions this season with a mean rating testing. In an article reviewing directional and
equal to 78 in competitions. Suppose we conduct nondirectional tests, Leventhal (1999) stated the
a one–independent sample z test to determine following hypotheses concerning the difference

• Need for data pre-processing


whether mean ratings increased this season between two population means.
8.10 SPSS IN FOCUS: A PREVIEW FOR CHAPTERS 9 TO 18 (compared to the previous three seasons) at a .05
level of significance.
As discussed in Section 8.5, it is rare that we know the value of the population
a. State the value of the test statistic and whether A B
variance, so the z test is not a common hypothesis test. It is so uncommon that
to retain or reject the null hypothesis.

• What is data pre-processing


SPSS can’t be used to compute this test statistic, although it can be used to µ1 – µ2 = 0 µ1 – µ2 = 0
compute all other test statistics described in this book. For each analysis, SPSS b. Compute effect size using Cohen’s d.
provides output analyses that indicate the significance of a hypothesis test and µ1 – µ2 > 0 µ1 – µ2 ≠ 0
24. A local school reports that its average GPA is
provide the information needed to compute effect size and even power. SPSS
2.66 ± 0.40 (µ ± σ). The school announces that it
statistical software can be used to compute nearly any statistic or measure used in
will be introducing a new program designed to
behavioral research. For this reason, most researchers use SPSS software to analyze

• Data Pre-processing tasks


improve GPA scores at the school. What is the a. Which did he identify as nondirectional?
their data.
effect size (d) for this program if it is expected to b. Which did he identify as directional?
improve GPA by:
30. The one-tailed tests. In their book, Common
a. .05 points?
8.11 APA IN FOCUS: REPORTING THE b. .10 points?
Errors in Statistics (and How to Avoid Them), Good
and Hardin (2003) wrote, “No one will know
TEST STATISTIC AND EFFECT SIZE c. .40 points? whether your [one-tailed] hypothesis was con-
ceived before you started or only after you’d
To report the results of a z test, we report the test statistic, p value, and effect size of 25. Will each of the following increase, decrease, or examined the data” (p. 347). Why do the
a hypothesis test. Here is how we could report the significant result for the z statistic have no effect on the value of Cohen’s d? authors state this as a concern for one-tailed
in Example 8.2: a. The sample size is decreased. tests?
b. The population variance is increased.
Test scores for students in the elite school were significantly higher than 31. The hopes of a researcher. Hayne Reese
the standard performance of test takers, z = 1.94, p < .03. c. The sample variance is reduced. (1999) wrote, “The standard method of statistical
d. The difference between the sample and popu- inference involves testing a null hypothesis that
Notice that when we report a result, we do not state that we reject or retain the lation mean is increased. the researcher usually hopes to reject” (p. 39).
null hypothesis. Instead, we report whether a result is significant (the decision was Why does the researcher usually hope to reject
to reject the null hypothesis) or not significant (the decision was to retain the null 26. State whether the effect size for a 1-point effect the null hypothesis?
hypothesis). Also, you are not required to report the exact p value, although it is (M – µ = 1) is small, medium, or large given the
following population variances: 32. Describing the z test. In an article describing
recommended. An alternative is to report it in terms of the closest value to the
a. σ = 1 hypothesis testing with small sample sizes,
hundredths or thousandths place that its value is less than. In this example, we
Collins and Morris (2008) provided the following
stated p < .03 for a p value actually equal to .0262. b. σ = 2
description for a z test: “Z is considered signifi-
Finally, it is often necessary to include a figure or table to illustrate a significant c. σ = 4 cant if the difference is more than roughly two
effect and the effect size associated with it. For example, we could describe the effect
d. σ = 6 standard deviations above or below zero (or more
size in one additional sentence supported by the following figure:

As shown in Figure 8.14, students in the elite school scored an average of 27


points higher on the exam compared to the general population (d = .19).
Need for data Pre-processing Data pre-processing tasks Data Cleaning (Missing values)
Removing the data entity: Most easiest way directly but this is usually
 Major data pre-processing tasks discouraged as it leads to loss of data, as you are removing the data
entity or feature values that can add value to data set as well.
• Data cleaning
“data data every where” Month FFMC DC temp RH wind
• Data integration mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9

• Data transformation oct


mar
90.6
NaN
686.9
77.5
NaN
8.3
33
97
NaN
4
mar 89.3 102.2 11.4 99 1.8
• Data reduction aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
Month FFMC DC temp RH wind
aug 91.5 608.2 8 86 2.2
mar 86.2 94.3 8.2 51 6.7
sep 91 692.6 NaN 63 5.4
oct 90.6 669.1 18 33 0.9
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8
aug 91.5 608.2 8 86 2.2
sep 92.5 698.6 22.8 40 4

Need for data Pre-processing? Data Cleaning Data Cleaning (Missing values)
Manually filing up of values : This approach is time consuming, and
Data cleaning: it is a procedure to "clean" the data by filling not recommended for huge data sets.
in missing values, smoothening noisy data, identifying or Month FFMC DC temp RH wind
removing outliers, and resolving data inconsistencies. mar
oct
86.2
90.6
94.3
669.1
8.2
18
51
33
6.7
0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8 Month FFMC DC temp RH wind
 Data cleaning tasks aug 92.3 NaN 22.2 NaN NaN mar 86.2 94.3 8.2 51 6.7
aug NaN 495.6 24.1 27 NaN
• Fill missing values aug 91.5 608.2 8 86 2.2
oct
oct
90.6
90.6
669.1
686.9
18
17
33
33
0.9
0.8
sep 91 692.6 NaN 63 5.4 mar 91.6 77.5 8.3 97 4
• Noise smoothening and outlier detection sep 92.5 698.6 22.8 40 4 mar 89.3 102.2 11.4 99 1.8
aug 92.3 380 22.2 92 1.8
• Resolving inconsistencies aug 90 495.6 24.1 27 2
aug 91.5 608.2 8 86 2.2
sep 91 692.6 22 63 5.4
sep 92.5 698.6 22.8 40 4

Need for data Pre-processing Data Cleaning Data Cleaning (Missing values)
 Data in the real world is “quite messy” Missing values: data values are not available. Central tendency technique : Here help of mean, median and mode
e.g. many data entities have no data values corresponding to a is taken to calculate the values to be replaced with missing values.
• incomplete: missing feature values, absence of certain
certain feature like BMI value missing for person in diabetes
crucial feature, or containing only aggregate data. dataset. Mean
 e.g. Height=“ ”
• noisy: containing errors or outliers Probable reasons for missing values:
• faulty measuring equipment Median
 e.g. Weight=“5000” or “-60”
• inconsistent: containing discrepancies in feature values. • reluctance of person to share certain detail
 e.g. Age=“20” and dob=“12 july 1990” • negligence on part of data entry operator
Mode : Mode is the most frequent value corresponding to a certain
 e.g. contradictions between duplicate records • feature unimportance at time of data collection feature in a given data set

Need for data Pre-processing Data Cleaning (Missing values) Data Cleaning (Missing values)
 Missing data handling techniques Replacing with mean value:
• Removing the data entity Month FFMC DC temp RH wind

“No qualitative data, less accurate results!” • Manually filling the values
mar
oct
86.2
90.6
94.3
669.1
8.2
18
51
33
6.7
0.9
• Replacing the missing value by central tendency (mean, oct 90.6 686.9 NaN 33 NaN Month FFMC DC temp RH wind
mar NaN 77.5 8.3 97 4
median, mode) for a feature vector mar 89.3 102.2 11.4 99 1.8
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
• Replacing the missing value by central tendency belonging aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 15.3 33 3.57
“Less accurate the model, higher the probability of wrong to same class for a feature vector. aug
aug
NaN
91.5
495.6
608.2
24.1
8
27
86
NaN
2.2
mar 90.5 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
sep 91 692.6 NaN 63 5.4
decision”
aug 92.3 458.3 22.2 58.7 3.57
“Technique selection is specific to user’s preference, dataset or feature sep 92.5 698.6 22.8 40 4 aug 90.5 495.6 24.1 27 3.57
type or problem set” aug 91.5 608.2 8 86 2.2
sep 91 692.6 15.3 63 5.4
sep 92.5 698.6 22.8 40 4
“A feature is an individual measurable property or characteristic of a
“Right decision require, more qualitative data” phenomenon being observed”

What is data Pre-processing Data Cleaning (Missing values) Data Cleaning (Missing values)
Data Pre-processing: it is that phase of any Machine Sample dataset related to forest fires
Learning process, which transforms, or encodes, the data to Month FFMC DC temp RH wind Now the problem of missing values is solved!!
bring it to such a state where it can be easily interpreted by mar 86.2 94.3 8.2 51 6.7
the learning algorithm. oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4 “ Replacing by mean value: Not a suitable method if data set
“Data pre-processing is not a single standalone entity but a has many outliers”
mar 89.3 102.2 11.4 99 1.8
collection of multiple interrelated tasks”
aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
“Collectively data pre-processing constitutes majority of the effort in aug 91.5 608.2 8 86 2.2 For example: weighs of humans
machine learning process (approx. 90 % )” sep 91 692.6 NaN 63 5.4 67, 78, 900,-56,389,-1 etc. Outlier
sep 92.5 698.6 22.8 40 4 Mean is 229.5
Data Cleaning (Missing values) Data Cleaning (Noisy data) Data Cleaning (Noisy data)
Replacing with median value: Binning method : performs the task of data smoothening. Outlier analysis: performs the task of data refinement by tracing
down the outliers with help of clustering and dealing with them.
Month FFMC DC temp RH wind Steps of be followed under binning method are:
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9 Step1: Sort the data into ascending order.
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4 Step2: Calculate the bin size (i.e. number of bins)
mar 89.3 102.2 11.4 99 1.8
Month FFMC DC temp RH wind
aug 92.3 NaN 22.2 NaN NaN
mar 86.2 94.3 8.2 51 6.7 Step3: Partition or distribute the data equally among the bins starting
aug NaN 495.6 24.1 27 NaN
oct 90.6 669.1 18 33 0.9
aug 91.5 608.2 8 86 2.2
oct 90.6 686.9 14.7 33 2.2 with first element of sorted data.
sep 91 692.6 NaN 63 5.4
mar 90.8 77.5 8.3 97 4
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8 Step4: perform data smoothening using bin means, bin boundaries, and
aug 92.3 495.6 22.2 40 2.2
aug 90.8 495.6 24.1 27 2.2 bin median.
aug 91.5 608.2 8 86 2.2
sep 91 692.6 14.7 63 5.4 Last bin can have one less or more element!!
sep 92.5 698.6 22.8 40 4

Data Cleaning (Missing values) Data Cleaning (Noisy data) Data Cleaning (Inconsistent data)
Replacing with mode value: Inconsistent Data: discrepancies between different data
Month FFMC DC temp RH wind
Example : 9, 21, 29, 28, 4, 21, 8, 24, 26
mar 86.2 94.3 8.2 51 6.7 items.
oct 90.6 669.1 18 33 0.9
Month FFMC DC temp RH wind
Step1: sorted the data 4, 8, 9, 21, 21, 24, 26, 28, 29 e.g. the “Address” field contains the “Phone number”
oct 90.6 686.9 NaN 33 NaN
mar 86.2 94.3 8.2 51 6.7
mar NaN 77.5 8.3 97 4
oct 90.6 669.1 18 33 0.9
Step2 : Bin size calculation
mar 89.3 102.2 11.4 99 1.8
aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 NaN 33 4 To resolve inconsistencies
aug NaN 495.6 24.1 27 NaN mar 90.6 77.5 8.3 97 4
aug 91.5 608.2 8 86 2.2 mar 89.3 102.2 11.4 99 1.8 Bin size =  Manual correction using external references
sep 91 692.6 NaN 63 5.4 aug 92.3 NaN 22.2 33 4
sep 92.5 698.6 22.8 40 4 aug 90.6 495.6 24.1 27 4  Semi-automatic tools
aug 91.5 608.2 8 86 2.2
= = 2.777 • To detect violation of known functional dependencies and data
sep 91 692.6 NaN 63 5.4
sep 92.5 698.6 22.8 40 4 constraints
But we need to take ceiling value, so bin size is 3 here
• To correct redundant data
“Mode is good option for missing values in case of categorical
variables”

Data Cleaning (Missing values) Data Cleaning (Noisy data) Data Cleaning (Inconsistent data)
Step 3 : Bin partitioning (equi-size)
Replacing with central tendency values corresponding to class for a
certain feature. Bin 1: 4, 8, 9
Bin 2: 21, 21, 24 “To avoid inconsistencies, perform data assessment like
Bin 3: 26, 28, 29 knowing what the data type of the features should be and whether
The technique is similar to the one discussed above
it is the same for all the data objects.”
Step 4 : data smoothening
Exception :
• Feature vector divided into number of subparts with  Using mean value : replace the bin values by bin average
corresponding to one class. Bin 1: 7, 7, 7
• The sub feature vector used to determine central tendencies Bin 2: 22, 22, 22
values in order to replace missing values of that feature and Bin 3: 27, 27, 27
particular class.
• e.g. employee salary vector can be divided into male and
female or job designations etc.

Data Cleaning (Noisy data) Data Cleaning (Noisy data) Data Integration
Noise is defined as a random variance in a measured variable. For  Using boundary values : replace the bin value by a closest Data Integration: It is the process of merging the data
numeric values, boxplots and scatter plots can be used to identify boundary value of the corresponding bin. from multiple sources into a coherent data store.
outliers. Bin 1: 4, 9, 9 “Boundary values remain unchanged in e.g. Collection of banking data from different banks at data stores of RBI
Bin 2: 21, 21, 24 boundary method”
Bin 3: 26, 29, 29 Issues in data integration

 Using median values : replace the bin value by a bin median. • Schema integration and feature
matching
Bin 1: 8, 8, 8
• Redundant features
Bin 2: 21, 21, 21
• Detection and resolution of data
Bin 3: 28, 28, 28 value conflicts.

Boxplot Scatter plot

Data Cleaning (Noisy data) Data Cleaning (Noisy data) Data Integration
Popular reasons of random variations are: Regression method : Linear regression and multiple linear Schema integration and feature matching:
• Malfunctioning of collection instruments. regression can be used to smooth the data, where the values are
conformed to a function.
• Data entry lags.
• Data transmission problems Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB

To deal with these anomalous values, data smoothing techniques


are applied, some popular ones are
• Binning method Cust.id Name Age DoB

• Regression
• Outlier analysis “Carefully analysis the metadata”
Data Integration Data Reduction Data Reduction (Attribute subset)
Redundant features: They are unwanted features. Statistical technique: chi-square test
• A chi-square test is used to test the
 To deal with redundant features correlation
analysis is performed. Denoted by r. independence of predictor and the

Cust.id Name Age DoB


target.
• In feature reduction, aim is to select
highly dependent features as per
r is +ve r is -ve r is zero
target.
• Independent features are removed.
“ Challenging task”
“ More the value of chi-square, higher is the dependence level.

Data Integration Data Reduction Data Reduction


Detection and resolution of data value conflicts: Low variance filter: Normalized features that have variance
Benefits of data reduction (distribution) less than a threshold are also removed, since
• Accuracy improvements. little changes in data means less information.
Product year Price Product Year Price Product Year Price • Overfitting (model fits for training data, but not for validation) risk F1 F2 F3
($) (Rs) (pound)
0.113 0.272 0.318
reduction. Variable (F1) = 0.115298 0.803 0.383 0.027
0.197 0.630 0.319
• Speed up in training. Variable (F2) = 0.103589 0.210 0.987 0.310
• Improved Data Visualization. 0.305 0.984 0.390
Research data for price Variable (F3) = 0.012525 0.464 0.031 0.314
Product year Price
of essential products • Increase in explainability of our model. 0.954 0.008 0.399
Very low variance 0.943 0.324 0.278
• Increase storage efficiency. 0.997 0.418 0.133
“Carefully analysis the metadata”
0.270 0.525 0.373
• Reduced storage cost.

Data Transformation Data Reduction Data Reduction


Major techniques of data reduction are: High correlation filter: Normalized attributes that have
This step is done before Data Reduction, but its details will
• Attribute subset selection correlation coefficient more than a threshold are also
be discussed after Data Reduction (Data Reduction needs removed, since similar trends means similar information is
• Low variance filter
50 consecutive min) carried. Correlation coefficient is usually calculates using
• High correlation filter
statistical methods such as Pearson’s chi-square value etc.
• Numerosity reduction
• Dimensionality reduction

Data Reduction Data Reduction


Attribute subset selection: The highly relevant attributes Numerosity Reduction: This enable to store the model of
should be used, rest all can be discarded. data instead of whole data.
e.g. Regression Models.

Techniques of attribute selection:


• Brute force approach: Try all possible feature subsets as input to
the machine learning algorithm and analyze the results.
• Statistical approach: The concept of statistical testing is applied for
selecting the most significant features from the original feature set.

Data Reduction Data Reduction (Attribute subset) Data Reduction


Data Reduction: It is a process of constructing a Brute force technique:
Dimensionality Reduction: It is the technique to reduce
condensed representation of the data set which is smaller F1 F2 F3 F4 F5 F6 F7 F8 F9 Target
the number of dimensions not simply by selecting a feature
in volume, while maintaining the integrity of original one. Step 1: Construct the power set corresponding to feature set. subset but much more than it.
The efficiency of results should not degrade with data
Step 2: Select an element from power set.  dimensions refers to the number of geometric planes the
reduction
Step 3: Measure the accuracy of Learning model corresponding to dataset lies in, which could be high so much so that it cannot
Some facts about data reduction in machine learning : be visualized with pen and paper.
• There exist a optimal number of feature in a feature set for selected feature.
Step 4: Repeat step 2 and Step 3 until desired accuracy is achieved “More the number planes, more is the complexity of the
corresponding Machine Learning task. dataset”
• Adding additional features than optimal ones (strictly necessary) “To decrease the number of iterations, Expert knowledge of
features is used”
results in a performance degradation ( because of added noise).
Data Reduction (PCA)

Data Reduction Stepwise working of PCA Data Reduction (PCA)


Step 1: Construction of covariance matrix i.e. A
Principal Component Analysis (PCA): It is a technique of Example:
dimensionality reduction which performs the said task by the reducing
higher-dimensional feature-space to a lower-dimensional feature- -0.6675 0.6154
space. It also helps to make visualization of dataset simple. Step 2: Computation of eigenvalues for covariance matrix. 0.6165- 0.6154
Eignevector: the direction of that line, while the eigenvalue is a 0.6154 -0.5675
Compute
number that tells us how the data set is spread out on the line which is 0.6154 0.7165-
eigenvectors
an Eigenvector. It show us the direction of our main axes (principal
components) of our data. The greater the eigenvalue, the greater the
variation along this axis 0.6165- 0.6154
0.5674 0.6154

Step 3: Compute eigenvectors corresponding to every eigenvalue 0.6154 0.7165-


0.6154 0.6674
obtained in step2:

PCA: Eigenvector compute


Data Reduction (PCA) Data Reduction (PCA)
Some of the major facts about PCA are: Stepwise working of PCA -.6675*x1+.6154x2=0
-0.6675 0.6154 Div by (.6675*.6154) throughout
 Principal components are new features that are constructed as a
Step 4: Sort the eigenvectors in decreasing order of eigenvalues and 0.6154 -0.5675  x1/.6154=x2/.6675 = say t
linear combinations or mixtures of the initial feature set.
choose k eigenvectors with the largest eigenvalues. V1=
 These combinations is performed in such a manner that all the
Step 5: Transform the data along the principal component axis. Unit Eigen vector for by dividing by
newly constructed principal components are uncorrelated.
 = = sqrt(.8243) =
 Together with reduction task, PCA also preserving as much .908
“Seems Difficult !!”
information as possible of original data set. V1= 1/ = 1/.908
No
=
“Just need some mathematical skills”

Data Reduction (PCA) Data Reduction (PCA) Data Reduction (PCA)


Some of the major facts about PCA are: Example: Example:
 Principal components are usually denoted by PCi, where i can be X Y
2.5 2.4 0.67787 -0.73517
0, 1, 2, 3,. ….,n (depending on the number of feature in original Compute
0.5 0.7 covariance
0.73517 0.67787
data set). 2.2 2.9 Matrix A Cov(X,X) Cov(X,Y)
1.9 2.2
 The major proportion of information about original feature set can First Principal Component (PC1) Second Principal Component (PC2)
3.1 3
Cov(Y,X) Cov(Y,Y)
be alone explained by first principal component i.e. PC1. 2.3 2.7 “vector corresponding to highest eigenvalue of considered as PC1
2 1.6 followed by other component as per their eigenvalue.”
 The remaining information can be obtained from other principal 1 1.1  To calculate the percentage of information explained by PC1
1.5 1.6 and PC2, divide each component by sum (1.284+0.049=1.333)
components in a decreasing proportion as per increase in value of
1.1 0.9 of eigenvalues (recall 1=1.284, 2=.049)
i.
Original dataset PC1 =1.284/1.333=96% PC2 =0.049/1.333= 4%

Data Reduction (PCA)


Step 4 helps to reduce the dimension by discarding the components
Data Reduction (PCA) Data Reduction (PCA) with very less percentage of information in a multi-dimentional
space. The remaining ones form a matrix of vector known as feature
Example:
X Y vector. Each column correspond to one principal component.
Step 5 data transformation along principal component using
2.5 2.4 0.69 0.49 0.4761 0.2401 0.3381
0.5 0.7 -1.31 -1.21 1.7161 1.4641 1.5851 PCA_dataset = EigenVectorT * Mean-Adjusted-DataT
2.2 2.9 0.39 0.99 0.1521 0.9801 0.3861
1.9 2.2 0.09 0.29 0.0081 0.0841 0.0261
Mean-Adjusted-Data => each point of the Original Data will be
3.1 3 1.29 1.09 1.6641 1.1881 1.4061 adjusted by subtracting respective means
2.3 2.7 0.49 0.79 0.2401 0.6241 0.3871
2 1.6 0.19 -0.31 0.0361 0.0961 -0.0589 Extra Step (Recover Orig Data from PCA-Data)
1 1.1 -0.81 -0.81 0.6561 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961 0.0961 Mean-Adjusted-DataT = PCA_dataset / EigenVectorT
1.1 0.9 -0.71 -1.01 0.5041 1.0201 0.7171 = PCA_dataset *EigenVecor
=1.81 Cov(X,X) Cov(Y,Y) Cov(X,Y)= Cov(Y,X) Original Data => add means to respective dimensional-values of the
= 0.6165 = 0.7165 = 0.6154
Mean-Adjusted-Data

Data Reduction (PCA) Data Reduction (PCA)


 Geometrically , it can be said that principal components Example:
are lines pointing the directions that captures maximum
amount of information about the data. Compute 0.6165 0.6154 1 0
0.6165 0.6154
eigenvalues
0.6154 0.7165 0.6154 0.7165 0 1
Simply, principal components are new axes to get better data
visibility with clear difference in observations.
Find determinant by
equating to zero & find 0.6165- 0.6154
values of 1 and 2
0.6154 0.7165-
How: (.6165- )*(.7165- )-(.6154*.6154)=0
Data Transformation (usually done before
Dimensionality Reduction) Data Transformation (Normalization) Feature Scaling
Data Transformation: It is a process to transform or Decimal scaling method:  The large difference in magnitude of features makes the process
X
consolidate the data in a form suitable for machine of model training difficult.
2 Decimal  Euclidian distance and Manhattan distance are required to be
learning algorithms. scaling
0.02
47
calculated by many model for its working which comes out to be
normalization 0.47 large in case of huge difference in magnitude values.
Major techniques of data transformation are :-
90  Moreover visualization is also difficult as data points are very far
0.9
Where, 18
off locations .
• Normalization is mapped value 0.18  KNN and K-means are some popularly known algorithms where
• Aggregation is data value to be mapped in 5 feature scaling is an important role player.
0.05
specific range
• Generalization is maximum of the count of digits in
minimum and maximum value of feature
• Feature construction vector corresponding to

Data Transformation Data Transformation Feature Scaling


Normalization: It is the technique of mapping the Aggregation : take the aggregated values in order to put
numerical feature values of any range into a specific the data in a better perspective.
smaller range i.e. 0 to 1 or -1 to 1.
e.g. in case of transactional data, the day to day sales of product
at various stores location can be aggregated store wise over
Popular methods of Normalization are:-
months or years in order to be analyzed for some decision
• Min-Max method making.
• Mean normalization
Benefits of aggregation
• Z score method
• Reduce memory consumption to store large data records.
• Decimal scaling method
• Provides more stable view point than individual data objects

Data Transformation (Aggregation)

Data Transformation (Normalization) Feature Scaling


Min-Max method: Feature Scaling Techniques
X
 Normalization (discussed under data transformation)
2 0  Standardization (a.k.a Z-score normalization)
Min-max
47 normalization
• It standardize the data to follow standard normal
0.512
distribution with mean 0 and standard deviation 1)
90
1
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.034
specific range
is minimum and maximum
value of feature vector corresponding to .

Image Source blog from towards data science Standard Normal Distribution

Data Transformation (Normalization) Data Transformation Dataset Splitting


Mean normalization Generalization: The data is generalized from low-level “After performing various pre-processing task on the dataset, it now
X
to higher order concepts using concept hierarchies. ready to be utilized by the exsiting machine learning algorithms”
2 -0.345
Mean e.g. categorical attributes like street can be generalized to higher
47 normalization “Wait !! Before modeling the dataset for analysis and decision
0.166 order concepts like city or country. making by machine learning algorithms, it is advisable to split it into
90
0.655 “The decision of generalization level depends on the problem training, validation, and testing set.”
Where, 18
is mapped value -0.164 statement”
is data value to be mapped in 5
-0.311
specific range Feature construction : New attributes are constructed
is mean of feature vector corresponding
to .
from the given set of attributes.
is minimum and maximum e.g. feature like mobile number and landline number combined
value of feature vector corresponding to . together under new feature contact number

Data Transformation (Normalization) Feature Scaling Dataset Splitting


Z Score method: Feature Scaling : It is a method used to normalize the
X
range of independent features of data. Basically it is the Training data: This is the part on which machine learning
2
Z score
-0.826
process of scaling down the feature magnitudes to bring algorithms are actually trained to build a model. The model
47 normalization 0.397 them in a common range platforms tries to learn the dataset and its various characteristics.
90
Where,
1.566 Need for feature scaling Salary Age tax Validation data : This is the part of the dataset which is
18
is mapped value -0.391
Feature 40000 35
used to validate our various model fits. In simpler words, it
is data value to be mapped in 5
-0.745 is used to tune the hyperparameters of the algorithm for
specific range
and is mean and standard deviation of
500000 42 better performance.
feature vector corresponding to . Magnitude Units 25000 26
“Validation!! only tuning No learning”
Tax = func (Salary, Age) 30000 30
Chapter 2

Dataset Splitting Dataset Splitting Background Mathematics

Test data : This part of the dataset is used to test our Commonly used split ratios are: This section will attempt to give some elementary background mathematical skills that
will be required to understand the process of Principal Components Analysis. The
topics are covered independently of each other, and examples given. It is less important

model and quantify the accuracy measures to depict its


to remember the exact mechanics of a mathematical technique than it is to understand

70% train, 15% val, 15% test.


the reason why such a technique may be used, and what the result of the operation tells
us about our data. Not all of these techniques are used in PCA, but the ones that are not

performance when deployed on real-world data.


explicitly required do provide the grounding on which the most important techniques
are based.
I have included a section on Statistics which looks at distribution measurements,

80% train, 10% val, 10% test.


or, how the data is spread out. The other section is on Matrix Algebra and looks at
eigenvectors and eigenvalues, important properties of matrices that are fundamental to
PCA.

Data splitting techniques: 60% train, 20% val, 20% test.


2.1 Statistics
The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set

• Simple random sampling (SRS) Many people split into 2 sets, instead of 3 sets:
of data, and what they tell you about the data itself.

2.1.1 Standard Deviation

• Systematic sampling
To understand standard deviation, we need a data set. Statisticians are usually con-

(i) training (ii) Validation/Testing


cerned with taking a sample of a population. To use election polls as an example, the
population is all the people in the country, whereas a sample is a subset of the pop-
ulation that the statisticians measure. The great thing about statistics is that by only

• Stratified sampling
measuring (in this case by doing a phone survey or similar) a sample of the population,
you can work out what is most likely to be the measurement if you used the entire pop-

Common Ratios: 70, 30 and 80,20


ulation. In this statistics section, I am going to assume that our data sets are samples

of some bigger population. There is a reference later in this section pointing to more
information about samples and populations.

✂✁☎✄✝✆✟✞✡✠✡☛☞✆✌✞✍✆✌✎✡✞✏✎✡✠✏✎✡☛✒✑✓☛✕✔✖☛✏✎✡✗✏✑✡✘
Here’s an example set:

✚✙
I could simply use the symbol to refer to this entire set of numbers. If I want to

Dataset Splitting ✜✛ ✚✢
refer to an individual number in this data set, I will use subscripts on the symbol to
indicate a specific number. Eg. refers to the 3rd number in , namely the number
4. Note that is the first number in the sequence, not
✣ like you may see in some
textbooks. Also, the symbol will be used to refer to the number of elements in the
set
There are a number of things that we can calculate about a data set. For example,
we can calculate the mean of the sample. I assume that the reader understands what the

Simple random sampling : ✤✥✁ ✦✧✩★ ✛ ✧


mean of a sample is, and will only give the formula:


In a simple random ✤
Notice the symbol (said “X bar”) to indicate the mean of the set . All this formula

sample each observation


says is “Add up all the numbers and then divide by how many there are”.
Unfortunately, the mean doesn’t tell us a lot about the data except for a sort of
middle point. For example, these two data sets have exactly the same mean (10), but

in the data set has an ✄✫✪✬✑✍✆✫✞✡✞✏✪✡✘✕✭ ✣✯✮ ✄✰✑✡✗✱✆✒✆✓✆✌✞✡✘


are obviously quite different:

equal chance of being So what is different about these two sets? It is the spread of the data that is different.
The Standard Deviation (SD) of a data set is a measure of how spread out the data is.

selected.
How do we calculate it? The English definition of the SD is: “The average distance

✣✳✲ ✆
from the mean of the data set to a point”. The way to calculate it is to compute the
squares of the distance from each data point to the mean of the set, add them all up,

✦✧✩★ ✵✛✶✵ ✧ ✆✒✲ ✷ ✤✸✷✺✹


divide by , and take the positive square root. As a formula:

✴✁ ✣✜✲
✴ ✵ ✣✻✲ ✆✒✷
“Result in biasness toward ✣
Where is the usual symbol for standard deviation of a sample. I hear you asking “Why

✵ ✣✼✲ ✆✏✷
are you using and not ?”. Well, the answer is a bit complicated, but in general,
if your data set is a sample data set, ie. you have taken a subset of the real-world (like

one category if distribution of surveying 500 people about the election) then you must use because it turns out
that this gives you an answer that is closer to the standard deviation that would result

categories is not proper.” ✵ ✣✽✲ ✆✒✷
if you had used the entire population, than if you’d used . If, however, you are not

should divide by instead of✣


calculating the standard deviation for a sample, but for an entire population, then you
. For further reading on this topic, the web page
http://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describes standard
deviation in a similar way, and also provides an example experiment that shows the

Department of Computer Science,


Set 1:
University of Otago
✵ ✤
✸✷ ✵ ✤ ✹
✸✷
✲ ✲
0 -10 100
8 -2 4
12 2 4
20 10 100

Dataset Splitting
Total 208
Divided by (n-1) 69.333
Square Root 8.3266

Set 2:

Systematic sampling: It involves selecting items from an ✵ ✤


✸✷ ✵ ✤ ✹
✸✷
✧ ✧ ✲ ✧ ✲
8 -2 4

ordered population using a skip or sampling interval. That


9 -1 1
11 1 1
Technical Report OUCS-2002-12 12 2 4

means that every "nth" data sample is chosen in a large


Total 10
Divided by (n-1) 3.333
A t ut or ia l on Pr incipa l Com pone nt s Ana lysis Square Root 1.8257

data set. Author:


Table 2.1: Calculation of standard deviation
Lindsa y I Sm it h
Department of Computer Science, University of Otago, New Zealand
difference between each of the denominators. It also discusses the difference between
samples and populations.
So, for our two data sets above, the calculations of standard deviation are in Ta-
ble 2.1.
And so, as expected, the first set has a much larger standard deviation due to the
fact that the data is much more spread out from the mean. Just as another example, the
data set:
✄✝✆✌✪✍✆✌✪✍✆✫✪✍✆✫✪✡✘

“Not a good choice when also has a mean of 10, but its standard deviation is 0, because all the numbers are the

data exhibits some


same. None of them deviate from the mean.

patterns”
2.1.2 Variance
Variance is another measure of the spread of data in a data set. In fact it is almost
identical to the standard deviation. The formula is this:
Department of Computer Science, ✛ ✵ ✤ ✹
✸✷
✴✹ ✁ ✦✧✾★ ✧ ✲
University of Otago, PO Box 56, Dunedin, Otago, New Zealand ✵ ✆✒✷
✣✿✲
http://www.cs.otago.ac.nz/research/techreports.php
4

You✹ will notice that this is simply the standard deviation squared, in both the ✹ symbol
( ✴ ) and the formula (there is no square root in the formula for variance). ✴ is the
usual symbol for variance of a sample. Both these measurements are measures of the
spread of the data. Standard deviation is the most common measure, but variance is
also used. The reason why I have introduced variance in addition to standard deviation
is to provide a solid platform from which the next section, covariance, can launch from.

Dataset Splitting
Exercises
Find the mean, standard deviation, and variance for each of these data sets.
❀ [12 23 34 44 59 70 98]
❀ [12 15 25 27 32 88 99]

Stratified sampling: This is accomplished by selecting samples at
[15 35 78 82 90 95 97]

random within each class. This approach ensures that the frequency
2.1.3 Covariance
A tutorial on Principal Components Analysis
The last two measures we have looked at are purely 1-dimensional. Data sets like this

distribution of the outcome is approximately equal within the training


could be: heights of all the people in the room, marks for the last COMP101 exam etc.
However many data sets have more than one dimension, and the aim of the statistical
Lindsay I Smith analysis of these data sets is usually to see if there is any relationship between the

and test sets. Mostly used in classification problems or where data


dimensions. For example, we might have as our data set both the height of all the
February 26, 2002 students in a class, and the mark they received for that paper. We could then perform

categories is available.
statistical analysis to see if the height of a student has any effect on their mark.
Standard deviation and variance only operate on 1 dimension, so that you could
only calculate the standard deviation for each dimension of the data set independently
of the other dimensions. However, it is useful to have a similar measure to find out how
much the dimensions vary from the mean with respect to each other.
Covariance is such a measure. Covariance is always measured between 2 dimen-
sions. If you calculate the covariance between one dimension and itself, you get the
variance. So, if you had a 3-dimensional data set (❁ , ❂ , ❃ ), then you could measure the
covariance between the ❁ and ❂ dimensions, the ❁ and ❃ dimensions, and the ❂ and ❃
dimensions. Measuring the covariance between ❁ and ❁ , or ❂ and ❂ , or ❃ and ❃ would
give you the variance of the ❁ , ❂ and ❃ dimensions respectively.
The formula for covariance is very similar to the formula for variance. The formula
for variance could also be written like this:
✸✷✤ ✵ ✧ ❆✷✤
❄ ✭✰❅ ✵ ❆✷✟✁ ✦✧✾★ ✛ ✵ ✧ ✲ ✲
✵ ✣✿✲ ✆✒✷
where I have simply expanded the square term to show both parts. So given that knowl-
edge, here is the formula for covariance:
✸✷✤ ✵ ❋ ✧ ✍✤❋ ✷
❇✫❈❉❄ ✵ ✽❊✰❋✍✷●✁ ✦✧✾★ ✛ ✵ ✧ ✲ ✲
✵ ✣✳✲ ✆✏✷

includegraphicscovPlot.ps

Figure 2.1: A plot of the covariance data showing positive relationship between the
number of hours studied against the mark received

Dataset Splitting
Chapter 1 ❋
It is exactly the same except that in the second set of brackets, the ’s are replaced by
’s. This says, in English, “For each data item, multiply the difference between the ❁
❁ ✵ ✣✳✲ ✆✏✷
value and the mean of , by the the difference between the value and the mean of . ❂ ❂
Add all these up, and divide by ”.
How does this work? Lets use some example data. Imagine we have gone into the
Introduction world and collected some 2-dimensional data, say, we have asked a bunch of students
how many hours in total that they spent studying COSC241, and the mark that they
received. So we have two dimensions, the first is the dimension, the hours studied, ❍
and the second is the ■ ❇✫❈❉❄ ✵ ❍ ❊ ■ ✷
dimension, the mark received. Figure 2.2 holds my imaginary

Split Ratio : It is the ratio in which the dataset must be


data, and the calculation of , the covariance between the Hours of study
This tutorial is designed to give the reader an understanding of Principal Components
done and the Mark received.
Analysis (PCA). PCA is a useful statistical technique that has found application in
So what does it tell us? The exact value is not as important as it’s sign (ie. positive
fields such as face recognition and image compression, and is a common technique for

splitted into training, validation, and testing set. It is highly


or negative). If the value is positive, as it is here, then that indicates that both di-
finding patterns in data of high dimension.
mensions increase together, meaning that, in general, as the number of hours of study
Before getting to a description of PCA, this tutorial first introduces mathematical
increased, so did the final mark.
concepts that will be used in PCA. It covers standard deviation, covariance, eigenvec-

dependent on type of model to be trained and dataset itself.


If the value is negative, then as one dimension increases, the other decreases. If we
tors and eigenvalues. This background knowledge is meant to make the PCA section
had ended up with a negative covariance here, then that would have said the opposite,
very straightforward, but can be skipped if the concepts are already familiar.
that as the number of hours of study increased the the final mark decreased.
There are examples all the way through this tutorial that are meant to illustrate the
In the last case, if the covariance is zero, it indicates that the two dimensions are
concepts being discussed. If further information is required, the mathematics textbook
independent of each other.

larger training ratio (most usual case): If our dataset and model
“Elementary Linear Algebra 5e” by Howard Anton, Publisher John Wiley & Sons Inc,
The result that mark given increases as the number of hours studied increases can
 ISBN 0-471-85223-6 is a good source of information regarding the mathematical back-
ground.
be easily seen by drawing a graph of the data, as in Figure 2.1.3. However, the luxury

are such that a lot of training is required, then we use a larger


of being able to visualize data is only available at 2 and 3 dimensions. Since the co-
variance value can be calculated between any 2 dimensions in a data set, this technique
is often used to find relationships between dimensions in high-dimensional data sets

chunk of the data just for training purposes. ❇✶❈❉❄ ✵ ✽❊✶❋✱✷ ❇✫❈❉❄ ✵ ❋❏❊❑❆✷
where visualisation is difficult.
You might ask “is equal to ”? Well, a quick look at the for-
✶❇ ❈❉❄ ✵ ✽❊✶❋✱✷ ❇✫❈❉❄ ✵ ❋▲❊▼◆✷ ✵ ✧ ✲ ❆✷✤ ✵ ❋ ✧ ✲ ❋✼✤ ✷
mula for covariance tells us that yes, they are exactly the same since the only dif-

e.g training on textual data, image data, or video data where ✵ ❋ ✧ ✲ ✍✤❋ ✷ ✵ ✧ ✲ ❆✷✤
ference between and is that is replaced by
. And since multiplication is commutative, which means that it
doesn’t matter which way around I multiply two numbers, I always get the same num-

thousands of features are involved. ber, these two equations give the same answer.

 larger validation ratio: If the model has a lot of


2.1.4 The covariance Matrix
Recall that covariance is always measured between 2 dimensions. If we have a data set
with more than 2 dimensions, there is more than one covariance measurement that can

hyperparameters that can be tuned, then keeping a higher could calculate ❇✫❈❉❄ ✵ ❁ ❊ ❂ ✷ ✵ ❇✫❈❉❄ ✵ ❁ ❊ ❃ ✷
be calculated. For example, from a 3 dimensional data set (dimensions , , ) you
, , and
◗ ✦P✹❉❖❚ ✹
. In fact, for an -dimensional data ❇✫❈❉❄ ✵ ❂ ❊ ❃ ✷ ✣
❁ ❂ ❃

percentage of data for the validation set is advisable. ✦❙❘ ❖ ❯


set, you can calculate different covariance values.

1 6
easier if we take the transpose of the feature vector and the data first, rather that having
⑤ ❭ ✭✝✉ ❬ ✭ ③ ✭
a little T symbol above their names from now on. ✣ is the final data set, with
data items in columns, and dimensions along rows.
What will this give us? It will give us the original data solely in terms of the vectors
Hours(H) Mark(M) we chose. Our original data set had two axes, ❁ and ❂ , so our data was in terms of
Data 9 39 them. It is possible to express data in terms of any two axes that you like. If these
15 56 Chapter 3 axes are perpendicular, then the expression is the most efficient. This was why it was
important that eigenvectors are always perpendicular to each other. We have changed
25 93
14 61 our data from being in terms of the axes ❁ and ❂ , and now they are in terms of our 2
10 50 eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we
18
0
75
32
Principal Components Analysis have left some of the eigenvectors out, the new data is only in terms of the vectors that
we decided to keep.
16 85 To show this on our data, I have done the final transformation with each of the
5 42 possible feature vectors. I have taken the transpose of the result in each case to bring
19 70 the data back to the nice table-like format. I have also plotted the final points to show
Finally we come to Principal Components Analysis (PCA). What is it? It is a way how they relate to the components.
16 66 of identifying patterns in data, and expressing the data in such a way as to highlight
20 80 In the case of keeping both eigenvectors for the transformation, we get the data and
their similarities and differences. Since patterns in data can be hard to find in data of the plot found in Figure 3.3. This plot is basically the original data, rotated so that the
Totals 167 749 high dimension, where the luxury of graphical representation is not available, PCA is eigenvectors are the axes. This is understandable since we have lost no information in
Averages 13.92 62.42 a powerful tool for analysing data. this decomposition.
The other main advantage of PCA is that once you have found these patterns in the The other transformation we can make is by taking only the eigenvector with the
data, and you compress the data, ie. by reducing the number of dimensions, without largest eigenvalue. The table of data resulting from that is found in Figure 3.4. As
Covariance:
much loss of information. This technique used in image compression, as we will see expected, it only has a single dimension. If you compare this data set with the one
✤ ✤ ✤ ✤
in a later section. resulting from using both eigenvectors, you will notice that this data set is exactly the
This chapter will take you through the steps you needed to perform a Principal
✷ ✷ ✷ ✷

✵ ✵ ✵ ✵

❍ ■ ❍

✲ ❍ ■

✲ ■ ❍

✲ ❍ ■

✲ ■

first column of the other. So, if you were to plot this data, it would be 1 dimensional,
9 39 -4.92 -23.42 115.23 Components Analysis on a set of data. I am not going to describe exactly why the and would be points on a line in exactly the ❁ positions of the points in the plot in
15 56 1.08 -6.42 -6.93 technique works, but I will try to provide an explanation of what is happening at each Figure 3.3. We have effectively thrown away the whole other axis, which is the other
25 93 11.08 30.58 338.83 point so that you can make informed decisions when you try to use this technique eigenvector.
14 61 0.08 -1.42 -0.11 yourself. So what have we done here? Basically we have transformed our data so that is
10 50 -3.92 -12.42 48.69 expressed in terms of the patterns between them, where the patterns are the lines that
18 75 4.08 12.58 51.33 most closely describe the relationships between the data. This is helpful because we
0 32 -13.92 -30.42 423.45 3.1 Method have now classified our data point as a combination of the contributions from each of
16 85 2.08 22.58 46.97 those lines. Initially we had the simple ❁ and ❂ axes. This is fine, but the ❁ and ❂
5 42 -8.92 -20.42 182.15 Step 1: Get some data values of each data point don’t really tell us exactly how that point relates to the rest of
19 70 5.08 7.58 38.51 In my simple example, I am going to use my own made-up data set. It’s only got 2 the data. Now, the values of the data points tell us exactly where (ie. above/below) the
16 66 2.08 3.58 7.45 dimensions, and the reason why I have chosen this is so that I can provide plots of the trend lines the data point sits. In the case of the transformation using both eigenvectors,
20 80 6.08 17.58 106.89 data to show what the PCA analysis is doing at each step. we have simply altered the data so that it is in terms of those eigenvectors instead of
Total 1149.89 The data I have used is found in Figure 3.1, along with a plot of that data. the usual axes. But the single-eigenvector decomposition has removed the contribution
Average 104.54 due to the smaller eigenvector and left us with data that is only in terms of the other.
Step 2: Subtract the mean
Table 2.2: 2-dimensional data set and covariance calculation 3.1.1 Getting the old data back
For PCA to work properly, you have to subtract the mean from each of the data dimen-
sions. The mean subtracted is the average across each dimension. So, all the values ❁
Wanting to get the original data back is obviously of great concern if you are using
have (the mean of the values of all the data points) subtracted, and all the values

❁ ❂
the PCA transform for data compression (an example of which to will see in the next
have subtracted from them. This produces a data set whose mean is zero.

section). This content is taken from


http://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html

7 12 17

A useful way to get all the possible covariance values between all the different
dimensions is to calculate them all and put them in a matrix. I assume in this tutorial
that you are familiar with matrices, and how they can be defined. So, the definition for
the covariance matrix for a set of data with ✣ dimensions is:
❱ ✁ ✵ ❇ ✧▼❳ ❨ ❊ ❇ ✧❩❳ ❨ ✁ ❇✫❈❉❄ ✵▼❬❪❭❴❫ ✧ ❊ ❬❵❭❩❫ ❨ ✷❛✷▼❊
✦P❲❙✦ x y x y
❁ ❂

❱ 2.5 2.4 .69 .49


❬❵❭❩❫✜❜ -.827970186 -.175115307
where ✦P❲❙✦ is a matrix with ✣ rows and ✣ columns, and is the ❁ th dimension. 0.5 0.7 -1.31 -1.21
1.77758033 .142857227
All that this ugly looking formula says is that if you have an ✣ -dimensional data set, 2.2 2.9 .39 .99 -.992197494 .384374989
then the matrix has ✣ rows and columns (so is square) and each entry in the matrix is 1.9 2.2 .09 .29
-.274210416 .130417207
the result of calculating the covariance between two separate dimensions. Eg. the entry Data = 3.1 3.0 DataAdjust = 1.29 1.09 Transformed Data= -1.67580142 -.209498461
on row 2, column 3, is the covariance value calculated between the 2nd dimension and 2.3 2.7 .49 .79 -.912949103 .175282444
the 3rd dimension. 2 1.6 .19 -.31
.0991094375 -.349824698
An example. We’ll make up the covariance matrix for an imaginary 3 dimensional 1 1.1 -.81 -.81 1.14457216 .0464172582
data set, using the usual dimensions ❁ , ❂ and ❃ . Then, the covariance matrix has 3 rows 1.5 1.6 -.31 -.31
.438046137 .0177646297
and 3 columns, and the values are this: 1.1 0.9 -.71 -1.01
1.22382056 -.162675287
❇✫❈❉❄ ✵ ❁ ❊ ❁ ✷ ❇✫❈❉❄ ✵ ❁ ❊ ❂ ✷ ❇✫❈❉❄ ✵ ❁ ❊ ❃ ✷
❱ ✁ ❇✫❈❉❄ ✵ ❂ ❊ ❁ ✷ ❇✫❈❉❄ ✵ ❂ ❊ ❂ ✷ ❇✫❈❉❄ ✵ ❂ ❊ ❃ ✷ Original PCA data
Data transformed with 2 eigenvectors

❇✫❈❉❄ ✵ ❃ ❊ ❁ ✷ ❇✫❈❉❄ ✵ ❃ ❊ ❂ ✷ ❇✫❈❉❄ ✵ ❃ ❊ ❃ ✷


2
4 "./doublevecfinal.dat"
"./PCAdata.dat"

1.5
Some points to note: Down the main diagonal, you see that the covariance value is
between one of the dimensions and itself. These are the variances for that dimension.
✵ ✭✝❊✰❝✰✷✟✁ ❇✶❈❉❄ ✵ ❝✫❊✶✭P✷
The other point is that since ❇✫❈❉❄
3
, the matrix is symmetrical about the 1
main diagonal.
0.5
Exercises 2

Work out the covariance between the ❁ and ❂ dimensions in the following 2 dimen- 0

sional data set, and describe what the result indicates about the data.
1 -0.5
Item Number: 1 2 3 4 5
❁ 10 39 19 23 28 -1
❂ 43 13 32 21 20
0
-1.5
Calculate the covariance matrix for this 3 dimensional set of data.

Item Number: 1 2 3 -2
❁ 1 -1 4 -1
-1 0 1 2 3 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

❂ 2 1 3
❃ 1 3 -1 Figure 3.3: The table of data by applying the PCA analysis using both eigenvectors,
Figure 3.1: PCA example data, original data on the left, data with the means subtracted
and a plot of the new data points.
on the right, and a plot of the data
2.2 Matrix Algebra
This section serves to provide a background for the matrix algebra required in PCA.
Specifically I will be looking at eigenvectors and eigenvalues of a given matrix. Again,
I assume a basic knowledge of matrices.

8 13 18

Step 3: Calculate the covariance matrix Transformed Data (Single eigenvector)


✞❡❞ ✆ ✆✒✆
✁ ❁
✞ ✆ ❢ ❞ ✎ This is done in exactly the same way as was discussed in section 2.1.4. Since the data
✞ ✞ -.827970186
is 2 dimensional, the covariance matrix will be ❢ . There are no surprises here, so I 1.77758033
✞❡❞ ❞ ✆✌✞ ❞
will just give you the result: -.992197494
✁ ✁❣✠ ☛r✆✌☛✏✎✒✎✒✎✏✎✒✎✏☛ ☛r✆✌✎✒✠✏✠✒✠✏✠✒✠✒✠ -.274210416
✞ ✆ ❢ ✞ ✑ ❢ ✞
❇✫❈❉❄ ✁ q
☛r✆✌✎✏✠✒✠✒✠✏✠✒✠✏✠
q
✔✏✆✌☛✒✎✏✎✒✎✏✎✒✎✒☛ -1.67580142
q q
-.912949103
So, since the non-diagonal elements in this covariance matrix are positive, we should .0991094375
Figure 2.2: Example of one non-eigenvector and one eigenvector
expect that both the ❁ and ❂ variable increase together. 1.14457216
.438046137
❞ ☛ 1.22382056
✞ ✁
❢ ✞ ✠ Step 4: Calculate the eigenvectors and eigenvalues of the covariance
matrix
Figure 3.4: The data after transforming using only the most significant eigenvector
✞❡❞ ☛ ✞✒✠ ☛ Since the covariance matrix is square, we can calculate the eigenvectors and eigenval-
✁ ✁❣✠
✞ ✆ ❢ ✠ ✆✌☛ ❢ ✠ ues for this matrix. These are rather important, as they tell us useful information about
our data. I will show you why soon. In the meantime, here are the eigenvectors and So, how do we get the original data back? Before we do that, remember that only if
eigenvalues: we took all the eigenvectors in our transformation will we get exactly the original data
Figure 2.3: Example of how a scaled eigenvector is still and eigenvector ✪✒✠✏✗✒✪✏✑✒❞✒❞✏✗✒✑✏✗ back. If we have reduced the number of eigenvectors in the final transformation, then
s ❭❩t s ✣ ❄ ✭✝✉✇✈ s①✴ ✁ q
✆ ✞✏✑✒✠✏✪✒✞✕✔✏✔✒✆ the retrieved data has lost some information.
q Recall that the final transform is this:
2.2.1 Eigenvectors ⑤ ❭ ✝✭ ✉ ❬ ✭ ③ ✭✻✁❦⑩ ❈❉❶ ⑤ s ✭ ③ ✈✕❅ s①⑥✼s②❇❩③✶❈ ❅ ⑩ ❈❉❶ ❬ ✭ ③ ✭✝❷ ✈ ✴❤③ ❊
✔✌❞✏✎✕✆✏✔✌✑✒☛✏✎✒☛ ☛✕✔✏✔✌✑✕✔✫❞✒❞✏✗✒✗ ✣ ❢ ✮❹❸
s ❭❴t s ✣ ❄rs②❇❩③✶❈ ❅ ✴ ✁ ✲ r q ✲ q ✔✌❞✏✎✕✆✒✔✫✑✒☛✏✎✒☛
As you know, you can multiply two matrices together, provided they are compatible ☛ ✔✒✔✫✑✕✔✌❞✏❞✒✗✏✗
sizes. Eigenvectors are a special case of this. Consider the two multiplications between q ✲ q which can be turned around so that, to get the original data back,
a matrix and a vector in Figure 2.2. ✛
It is important to notice that these eigenvectors are both unit eigenvectors ie. their ⑩ ❈❉❶ ❬ ✭ ③ ✭✝❷ ✈ ❤✴ ③ ✁❼⑩ ❈❉❶ ⑤ s ✭ ③ ✈✕❅ ①s ⑥✼s②❇❩③✶❈ ❅ ⑤ ❭ ✝✭ ✉ ❬ ✭ ③ ✭
In the first example, the resulting vector is not an integer multiple of the original
lengths are both 1. This is very important for PCA, but luckily, most maths packages,
❻✮ ❸ ❘ ❢ ✣
vector, whereas in the second example, the example is exactly 4 times the vector we ⑩ ⑤ ✭ ✈❺❅ ❅ ✛ ⑩ ⑤ ✭ ✕ ✈ ❅ ❅
began with. Why is this? Well, the vector is a vector in 2 dimensional space. The
when asked for eigenvectors, will give you unit eigenvectors. where ❈❉❶ s ③ s②⑥✻s②❇❩③✶❈
❘ is the inverse of
❈❉❶ s ③ s①⑥✼s②❇❩③✶❈ . However, when
❞ So what do they mean? If you look at the plot of the data in Figure 3.2 then you can we take all the eigenvectors in our feature vector, it turns out that the inverse of our
vector ✞ (from the second example multiplication) represents an arrow pointing see how the data has quite a strong pattern. As expected from the covariance matrix, feature vector is actually equal to the transpose of our feature vector. This is only true
✵ ✪❙❊❤✪✕✷ ✵ ❞❙❊❤✞✕✷ they two variables do indeed increase together. On top of the data I have plotted both because the elements of the matrix are all the unit eigenvectors of our data set. This
from the origin, , to the point . The other matrix, the square one, can be
the eigenvectors as well. They appear as diagonal dotted lines on the plot. As stated makes the return trip to our data easier, because the equation becomes
thought of as a transformation matrix. If you multiply this matrix on the left of a
in the eigenvector section, they are perpendicular to each other. But, more importantly,
vector, the answer is another vector that is transformed from it’s original position. ⑩ ❈❉❶ ❬ ✭ ③ ✭✝❷ ✈ ✴❤③ ✁❼⑩ ❈❉❶ ⑤ s ✭ ③ ✈✕❅ s②⑥✼s①❇▼③✶❈ ❅✒❽ ⑤ ❭ ✭✝✉ ❬ ✭ ③ ✭
they provide us with information about the patterns in the data. See how one of the ✮❹❸ ❢ ✣
It is the nature of the transformation that the eigenvectors arise from. Imagine a
eigenvectors goes through the middle of the points, like drawing a line of best fit? That
transformation matrix that, when multiplied on the left, reflected vectors in the line
✁ ✁ eigenvector is showing us how these two data sets are related along that line. The But, to get the actual original data back, we need to add on the mean of that original
❂ ❁ . Then you can see that if there were a vector that lay on the line ❂ ❁ , it’s
second eigenvector gives us the other, less important, pattern in the data, that all the data (remember we subtracted it right at the start). So, for completeness,
reflection it itself. This vector (and all multiples of it, because it wouldn’t matter how
points follow the main line, but are off to the side of the main line by some amount.
long the vector was), would be an eigenvector of that transformation matrix. ⑩ ❈❉❶✼❾ ❅ ❭❩t❿❭ ✭✝✉ ❬ ✭ ③ ✭✼✁ ✵ ⑩ ❈❉❶ ⑤ s ✭ ③ ✈✕❅ s②⑥✻s②❇❩③✶❈ ❅✏❽ ⑤ ❭ ✭❿✉ ❬ ✭ ③ ✭P✷ ✐ ❾ ❅ ❭❩t✏❭ ✭✝✉ s ✭ ✣
What properties do these eigenvectors have? You should first know that eigenvec-
So, by this process of taking the eigenvectors of the covariance matrix, we have ✣ ❢ ✣ ✣ ■
been able to extract lines that characterise the data. The rest of the steps involve trans-
tors can only be found for square matrices. And, not every square matrix has eigen-
forming the data so that it is expressed in terms of them lines. This formula also applies to when you do not have all the eigenvectors in the feature
vectors. And, given an ✣ ❢ ✣ matrix that does have eigenvectors, there are ✣ of them.
❞ ❞ vector. So even when you leave out some eigenvectors, the above equation still makes
Given a ❢ matrix, there are 3 eigenvectors.
the correct transform.
Another property of eigenvectors is that even if I scale the vector by some amount Step 5: Choosing components and forming a feature vector
I will not perform the data re-creation using the complete feature vector, because the
before I multiply it, I still get the same multiple of it as a result, as in Figure 2.3. This
Here is where the notion of data compression and reduced dimensionality comes into result is exactly the data we started with. However, I will do it with the reduced feature
is because if you scale a vector by some amount, all you are doing is making it longer,
it. If you look at the eigenvectors and eigenvalues from the previous section, you vector to show you how information has been lost. Figure 3.5 show this plot. Compare

9 14 19

not changing it’s direction. Lastly, all the eigenvectors of a matrix are perpendicular, Original data restored using only a single eigenvector
ie. at right angles to each other, no matter how many dimensions you have. By the way, 4
"./lossyplusmean.dat"
another word for perpendicular, in maths talk, is orthogonal. This is important because
it means that you can express the data in terms of these perpendicular eigenvectors,
instead of expressing them in terms of the ❁ and ❂ axes. We will be doing this later in
the section on PCA. 3

Another important thing to know is that when mathematicians find eigenvectors,


they like to find the eigenvectors whose length is exactly one. This is because, as you
know, the length of a vector doesn’t affect whether it’s an eigenvector or not, whereas 2
the direction does. So, in order to keep eigenvectors standard, whenever we find an
eigenvector we usually scale it to make it have a length of 1, so that all eigenvectors
have the same length. Here’s a demonstration from our example above. Mean adjusted data with eigenvectors overlayed


2
"PCAdataadjust.dat" 1

✞ (-.740682469/.671855252)*x
(-.671855252/-.740682469)*x
1.5

is an eigenvector, and the length of that vector is 0


1
✵ ❞ ✹✓✐ ✞ ✹ ✷✟✁❦❥ ✆❧❞
0.5
so we divide the original vector by this much to make it have a length of 1. -1
-1 0 1 2 3 4
❞ ❞❙♦ ❥ ✫✆ ❞
♠ ❥ ✆✌❞♥✁
0
✞ ✞❙♦ ❥ ✫✆ ❞ Figure 3.5: The reconstruction from the data that was derived using only a single eigen-
-0.5 vector
❞ ❞
How does one go about finding these mystical eigenvectors? Unfortunately, it’s
only easy(ish) if you have a rather small matrix, like no bigger than about ❢ . After -1
it to the original data plot in Figure 3.1 and you will notice how, while the variation
that, the usual way to find the eigenvectors is by some complicated iterative method
along the principle eigenvector (see Figure 3.2 for the eigenvector overlayed on top of
which is beyond the scope of this tutorial (and this author). If you ever need to find the -1.5 the mean-adjusted data) has been kept, the variation along the other component (the
eigenvectors of a matrix in a program, just find a maths library that does it all for you.
other eigenvector that we left out) has gone.
A useful maths package, called newmat, is available at http://webnz.com/robert/ .
-2
Further information about eigenvectors in general, how to find them, and orthogo- -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
nality, can be found in the textbook “Elementary Linear Algebra 5e” by Howard Anton, Exercises
Publisher John Wiley & Sons Inc, ISBN 0-471-85223-6.
Figure 3.2: A plot of the normalised data (mean subtracted) with the eigenvectors of ❀

What do the eigenvectors of the covariance matrix give us?


the covariance matrix overlayed on top.
2.2.2 Eigenvalues

At what point in the PCA process can we decide to compress the data? What
effect does this have?
Eigenvalues are closely related to eigenvectors, in fact, we saw an eigenvalue in Fig-
ure 2.2. Notice how, in both those examples, the amount by which the original vector

For an example of PCA and a graphical representation of the principal eigenvec-


was scaled after multiplication by the square matrix was the same? In that example, tors, research the topic ’Eigenfaces’, which uses PCA to do facial recognition
the value was 4. 4 is the eigenvalue associated with that eigenvector. No matter what
multiple of the eigenvector we took before we multiplied it by the square matrix, we
would always get 4 times the scaled vector as our result (as in Figure 2.3).
So you can see that eigenvectors and eigenvalues always come in pairs. When you
get a fancy programming library to calculate your eigenvectors for you, you usually get
the eigenvalues as well.

10 15 20

Exercises will notice that the eigenvalues are quite different values. In fact, it turns out that

❞ ✪ ✆
the eigenvector with the highest eigenvalue is the principle component of the data set.
For the following square matrix:

✲✲ ✠☛♣✪✆ ✲ ✞ ✞
In our example, the eigenvector with the larges eigenvalue was the one that pointed
down the middle of the data. It is the most significant relationship between the data
dimensions.
In general, once eigenvectors are found from the covariance matrix, the next step
is to order them by eigenvalue, highest to lowest. This gives you the components in
order of significance. Now, if you like, you can decide to ignore the components of
Chapter 4
Decide which, if any, of the following vectors are eigenvectors of that matrix and

✞✞ ✲✪ ✆ ✲✆✆ ✪✆ ❞✞
give the corresponding eigenvalue. lesser significance. You do lose some information, but if the eigenvalues are small, you
don’t lose much. If you leave out some components, the final data set will have less
Application to Computer Vision
✲✆ ✞ ❞ ✪ ✆
dimensions than the original. To be precise, if you originally have ✣ dimensions in
your data, and so you calculate ✣ eigenvectors and eigenvalues, and then you choose
only the first ④ eigenvectors, then the final data set has only ④ dimensions.
What needs to be done now is you need to form a feature vector, which is just
a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors
that you want to keep from the list of eigenvectors, and forming a matrix with these This chapter will outline the way that PCA is used in computer vision, first showing
eigenvectors in the columns. how images are usually represented, and then showing what PCA can allow us to do
with those images. The information in this section regarding facial recognition comes
⑤ s ✭ ③ ✈✕❅ s②⑥⑦s②❇❩③✶❈ ❅♥✁ ✵ s ❭❩t ✛ s ❭❩t ✹ s ❭❩t ✙ q⑧q⑨q⑨q s ❭❩t ✷ from “Face Recognition: Eigenface, Elastic Matching, and Neural Nets”, Jun Zhang et
✦ al. Proceedings of the IEEE, Vol. 85, No. 9, September 1997. The representation infor-
Given our example set of data, and the fact that we have 2 eigenvectors, we have mation, is taken from “Digital Image Processing” Rafael C. Gonzalez and Paul Wintz,
two choices. We can either form a feature vector with both of the eigenvectors: Addison-Wesley Publishing Company, 1987. It is also an excellent reference for further
☛r✔✒✔✫✑✕✔✌❞✏❞✒✗✏✗ ✔✫❞✒✎✕✆✏✔✌✑✏☛✒✎✒☛ information on the K-L transform in general. The image compression information is
✲ q ✔✫❞✒✎r✆✒✔✌✑✏☛✒✎✏☛ ✲ ✕ q
☛ ✔✏✔✌✑✕✔✫❞✒❞✏✗✒✗ taken from http://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html,
✲ q q which also provides examples of image reconstruction using a varying amount of eigen-
vectors.
or, we can choose to leave out the smaller, less significant component and only have a
single column: ☛✕✔✏✔✌✑✕✔✫❞✒❞✏✗✒✗
✲ q ✔✌❞✏✎✕✆✒✔✫✑✒☛✏✎✒☛ 4.1 Representation
✲ q
We shall see the result of each of these in the next section.
➀ ➀
When using these sort of matrix techniques in computer vision, we must consider repre-
sentation of images. A square, by image can be expressed as an -dimensional ➀✹
➁✁ ❁ ✛ ❁ ✹ ❁ ✙ ✥
vector
Step 5: Deriving the new data set
This the final step in PCA, and is also the easiest. Once we have chosen the components
q q ❁➃➂➅➄
➀ ❁ ✛ ✲❣❁ ➂
(eigenvectors) that we wish to keep in our data and formed a feature vector, we simply where the rows of pixels in the image are placed one after the other to form a one-


take the transpose of the vector and multiply it on the left of the original data set, dimensional image. E.g. The first elements ( will be the first row of the
transposed. image, the next elements are the next row, and so on. The values in the vector are
⑤ ❭ ✝✭ ✉ ❬ ✭ ③ ✭✻✁❦⑩ ❈❉❶ ⑤ s ✭ ③ ✈✕❅ s①⑥✼s②❇❩③✶❈ ❅ ⑩ ❈❉❶ ❬ ✭ ③ ✭✝❷ ✈ ✴❤③ ❊ the intensity values of the image, possibly a single greyscale value.
✣ ❢ ✮❹❸
⑩ ⑤ ✭ ✈❺❅ ❅
where ❈❉❶ s ③ s②⑥✻s②❇❩③✶❈ is the matrix with the eigenvectors in the columns trans-
4.2 PCA to find patterns
➀ ➀
posed so that the eigenvectors are now in the rows, with the most significant eigenvec-
⑩ ❬ ✭ ③ ✭✝❷ ✈ ✴❤③
tor at the top, and ❈❉❶ ✮❹❸ is the mean-adjusted data transposed, ie. the data Say we have 20 images. Each image is pixels high by pixels wide. For each
items are in each column, with each row holding a separate dimension. I’m sorry if image we can create an image vector as described in the representation section. We
this sudden transpose of all our data confuses you, but the equations from here on are can then put all the images together in one big image-matrix like this:

11 16 21
UNIT 9 SIMPLE LINEAR REGRESSION Regression Modelling c) The sum of the observed values (Yi) is equal to the sum of the fitted
 
values Ŷi :
Structure n n

➆ ❫ ✭ t s①⑥✻s①❇ ✆ 9.1 Introduction  Y   Ŷ


i 1
i
i 1
i

➆ ❫ ✭ t s①⑥✻s①❇ ✞ Objectives
➆ ❫ ✭ t s①✴ ✭ ③❅ ❭ ✁
■ ❁ q 9.2 Simple Linear Regression d) The least square regression lines always pass through the centroid
q
➆ ❫ ✭ t ②s ⑥⑦s②❇ ✞✏✪ 9.3 Least Squares Estimation of Parameters  X, Y  of the data since we know that the two lines of regression
9.4 Fitting of Regression Line intersect at this point.
which gives us a starting point for our PCA analysis. Once we have performed PCA, 9.5 Residual Analysis e) The sum of the residuals weighted by the corresponding value of the
we have our original data in terms of the eigenvectors we found from the covariance n
Scaling of Residuals
matrix. Why is this useful? Say we want to do facial recognition, and so our original
images were of peoples faces. Then, the problem is, given a new image, whose face Residual Plots
regressor variables is always zero, i.e., X r
i 1
i i 0
from the original set is it? (Note that the new image is not one of the 20 we started Normal Probability Plot
with.) The way this is done is computer vision is to measure the difference between f) The sum of the residuals weighted by the corresponding fitted values is
9.6 Summary n
the new image and the original images, but not along the original axes, along the new
axes derived from the PCA analysis.
9.7 Solutions/Answers always equal to zero, i.e.,  Ŷ r
i 1
i i 0
It turns out that these axes works much better for recognising faces, because the
PCA analysis has given us the original images in terms of the differences and simi- 9.1 INTRODUCTION Now, you should answer the following questions to assess your understanding.
larities between them. The PCA analysis has identified the statistical patterns in the In Blocks 1 and 2, you have learnt some basic methods of optimisation of
data.
E3) Describe the method of obtaining the least squares estimates of the
✹ ✹ various problems such as LPP, transportation problem, assignment problem, parameters a and b for the regression model Y = a + bX + e.
Since all the vectors are ➀ dimensional, we will get ➀ eigenvectors. In practice,
queueing problem, scheduling and sequencing problem and inventory
we are able to leave out some of the less significant eigenvectors, and the recognition E4) Describe the properties (in brief) of the estimates of the parameters a and
still performs well. problem. In this unit, we discuss the concepts of regression modelling.
Regression analysis is a statistical tool for investigating and analysing the b for the regression model Y = a + bX + e.
average relationship between two or more variables. In regression analysis,
4.3 PCA for image compression one variable is referred to as the dependent variable or response variable, 9.4 FITTING OF REGRESSION LINE
whereas the other variable is referred to as independent variable, predictor
Using PCA for image compression also know as the Hotelling,
✹ or Karhunen and
✹ Leove variable or regressor variable. The regression analysis with only one predictor
(KL), transform. If we have 20 images, each with ➀ pixels, we can form ➀ vectors, In Sec. 9.3, you have learnt the method of least squares for estimating the
variable is known as simple regression analysis and with two or more parameters a and b in the simple linear regression model Y= a + b X +e. You
each with 20 dimensions. Each vector consists of all the intensity values from the same
pixel from each picture. This is different from the previous example because before we
predictor variables is known as multiple regression analysis. have also learnt the properties of the least squares estimates. In this section, we
had a vector for image, and each item in that vector was a different pixel, whereas now The term simple linear regression refers to a regression equation with only discuss the method of fitting the simple linear regression equation for the
we have a vector for each pixel, and each item in the vector is from a different image. one predictor variable and the equation is linear. This means that if Y is the given data of n pairs of observations on (X, Y).
Now we perform the PCA on this set of data. We will get 20 eigenvectors because dependent variable and X, the independent variable, the regression equation
each vector is 20-dimensional. To compress the data, we can then choose to transform is of the form Y = a + b X. Such an analysis is useful in many situations, Let the given data of n pairs of observations on X and Y be as follows:
the data only using, say 15 of the eigenvectors. This gives us a final data set with
✆➇♦✶✠ e.g., in analysing the relationship between the number of customers and
only 15 dimensions, which has saved us of the space. However, when the original
monthly sales of a product. In regression analysis, we use the method of X: X1 X2 X3 . … … Xi ….. … Xn
data is reproduced, the images have lost some of the information. This compression
technique is said to be lossy because the decompressed image is not exactly the same least squares to fit the regression equation to the data. We discuss the Y: Y1 Y2 Y3 . … … Yi ….. … Yn
as the original, generally worse. concept of simple linear regression, the method of least squares and fitting
of regression line in Secs. 9.2 to 9.4. In Sec. 9.5, we present residual where Y is the dependent variable and X, the independent variable. Suppose,
analysis. we wish to fit the following simple regression equation to the
In the next unit, we shall discuss the test of significance of regression data:
coefficients, the coefficient of determination and confidence intervals of
Y = a + bX …(18)
regression coefficients.
Objectives where a is the intercept and b is the slope of the equation. For fitting equation
(18) to the data on (X, Y), we follow the steps given below:
After studying this unit, you should be able to:
22
 describe the simple linear regression model; Step 1: We draw a scatter diagram by plotting the (X, Y) points given in data.
 estimate the model parameters using the method of least squares; Step 2: We construct a table as given below and take the sum of the values of
 describe the properties of the estimates; Xi, XiYi, and Xi2. We write the values of å X, å Y, å XY and å X
2

 fit the simple linear regression line; and in the last row:
 check the model assumptions with the help of residual analysis.
5 10

Regression Modelling Simple Linear Regression


9.2 SIMPLE LINEAR REGRESSION X Y XY X2
X1 Y1 X1 Y1 X12
Regression analysis may broadly be defined as the analysis of relationships X2 Y2 X2 Y2 X 22
among variables. It is popular because it explores the power of statistics as a
X3 Y3 X3 Y3 X 32
tool for establishing average relationships among variables. This relationship
is given as an equation that helps to predict the dependent variable Y through . . .
.
one or more independent variables. In regression analysis, the variable whose . . .
values vary with the variations in the values of the other variable(s) is called .
Appendix A the dependent variable or response variable. The other variables which are Xi Yi Xi Y i X i2
independent in nature and influence the response variable are called .
. . .
independent variables, predictor variables or regressor variables. A .
regression equation with a single independent variable is called a simple . . .

Implementation Code regression equation and it is linear when the model parameters have a linear Xn Yn Xn Yn X 2n
relation.
2
X Y  XY X
2

The equation Y = a + b X may also be classified as simple linear regression


because it has only one independent variable X and by change of variable, e.g., Step 3: We express of â given in equation (10) as follows:
This is code for use in Scilab, a freeware alternative to Matlab. I used this code to
1
generate all the examples in the text. Apart from the first macro, all the rest were X2=T, we can get a straight line of the form Y = a + b T even though Y bears a â  Y  bX   Y  b X 
n
… (19)
written by me. quadratic relationship with X. Its conceptual usefulness exists in analysing the
relationship between a variable of interest and a set of related predictor
 X,  Y,  XY and X
2
Step 4: We substitute the values of in
variables using an equation.
// This macro taken from equation (11) and calculate the optimum value of b̂ as
// http://www.cs.montana.edu/˜harkin/courses/cs530/scilab/macros/cov.sci Let us take an example where regression technique may be useful. Suppose a
// No alterations made
statistician employed by a cold drink bottler is analysing the product delivery n  XY   X Y
b̂  … (20)
n  X 2   X 
2
// Return the covariance matrix of the data in x, where each column of x
and service operation for vending machines. He would like to find how the
delivery time taken by the delivery man to load and service a machine is
// is one dimension of an n-dimensional data set. That is, x has x columns
// and m rows, and each row is one sample. related to the volume of delivery cases. The statistician visits 50 randomly Step 5: Now for obtaining the value of â , we substitute the value of  X,  Y
// chosen retailer shops having vending machines and observes the delivery time and slope b̂ in equation (19) and obtain the optimum value of intercept â
// For example, if x is three dimensional and there are 4 samples. (in minutes) and the volume of delivery cases for each shop. He plots those 50 as
 
// x = [1 2 3;4 5 6;7 8 9;10 11 12] observations on a graph, which shows that an approximate linear relationship
1
// c = cov (x) exists between the delivery time and delivery volume. If Y represents the â   Y  b̂  X
delivery time and X, the delivery volume, the equation of a straight line n
function [c]=cov (x) relating these two variables may be given as Step 6: We put these values of â and b̂ in the regression equation and get
// Get the size of the array
sizex=size(x); Y = a + bX … (1) Ŷ  â  b̂X
// Get the mean of each column Let us obtain the regression line for a given set of data.
meanx = mean (x, "r"); where a is the intercept and b, the slope.
// For each pair of variables, x1, x2, calculate Example 1: Sales data of 10 months for a coffee house situated near a prime
// sum ((x1 - meanx1)(x2-meanx2))/(m-1) location of a city comprising the number of customers (in hundreds) and monthly
In such cases, we draw a straight line in the form of equation (1) so that the
for var = 1:sizex(2), sales (in Thousand Rupees) are given below:
data points generally fall near the straight line. Now, suppose the points do not
x1 = x(:,var); S. No. No. of Customers Monthly Sales
fall exactly on the straight line. Then we should modify equation (1) to
mx1 = meanx (var); (in hundreds) (in thousand Rs.)
for ct = var:sizex (2),
minimise the difference between the observed value of Y and that given by the
straight line (a + b X). This is known as error. 1 6.0 01
x2 = x(:,ct);
2 6.1 06
mx2 = meanx (ct);
v = ((x1 - mx1)’ * (x2 - mx2))/(sizex(1) - 1); The error e, which is the difference between the observed value and the 3 6.2 08
predicted value of the variable of interest Y, may be conveniently assumed as 4 6.3 10
a statistical error. This error term accounts for the variability in Y that cannot 5 6.5 11
23 be explained by the linear relationship between X and Y. It may arise due to 6 7.1 20
the effects of other factors. Thus, a more plausible model for the variable of 7 7.6 21
interest (Y) may be given as
8 7.8 22
Y=a+bX+e … (2) 9 8.0 23
10 8.1 25
where the intercept a and the slope b are unknown constants and e is a random
error component. Find the simple linear regression equation that fits the given data. 11
6

Equation (2) is called a linear regression model. Since this model involves Simple Linear Regression Regression Modelling Solution: It is given that n = 10. Let us assume an equation of simple linear
only one regressor or independent variable, it is called a simple linear regression as follows:
regression model. In case of two or more than two independent variables, it is
Y = a + bX
cv(var,ct) = v; known as a multiple linear regression model.
cv(ct,var) = v; where X is the independent variable and Y, the dependent variable. Now for
// do the lower part of c also. The following assumptions are made about the error term e in the regression
finding the fitted regression equation, we follow the procedure described in
end, model given in equation (2):
Steps 1 to 6.
end, i) the error term e is a random variable with mean zero; i.e., E (e) = 0;
c=cv; We first draw a scatter diagram by plotting the (X, Y) points as shown in
ii) the variance of the error term e, denoted by σ2, is the same for all values of X; Fig. 9.1.
iii) all the error terms, e1, e2, …, en, are independent. This means that the
// This a simple wrapper function to get just the eigenvectors
value of e for a particular value of X is not related to the value of e for
// since the system call returns 3 matrices
any other value of X. Thus, the value of Y for a particular X is
Monthly Sales
function [x]=justeigs (x)
// This just returns the eigenvectors of the matrix independent of any other value of X; and 30

iv) the error term e is a normally distributed random variable. 25


[a, eig, b] = bdiag(x); 20
These assumptions provide the theoretical basis for the t-test and F-test used to
determine whether the relationship between X and Y is significant and for the 15
x= eig; Monthly Sales
estimation of confidence and prediction interval. 10

You may now like to answer the following questions, which will help you 5

// this function makes the transformation to the eigenspace for PCA


assess your understanding of simple linear regression. 0
// parameters: 6 6.5 7 7.5 8 8.5
E1) What is simple linear regression ? Write the equation for simple linear
// adjusteddata = mean-adjusted data set regression model.
// eigenvectors = SORTED eigenvectors (by eigenvalue)
// dimensions = how many eigenvectors you wish to keep E2) Answer the following questions for the error term e in the regression Fig. 9.1: Scatter diagram for the (X, Y) points given in the data of monthly sales.
// model Y = a + b X + e: We construct a table for finding the values of X, Y, XY and X2
// The first two parameters can come from the result of calling as follows:
// PCAprepare on your data. i) What are the mean and variance of the random error e?
// The last is up to you. ii) Which distribution does it follow? X Y XY X2
iii) Are the errors e1, e2, … independent? 6.0 01 6.0 36.00
function [finaldata] = PCAtransform(adjusteddata,eigenvectors,dimensions)
finaleigs = eigenvectors(:,1:dimensions); 6.1 06 36.6 37.21
prefinaldata = finaleigs’*adjusteddata’; 6.2 08 49.6 38.44
finaldata = prefinaldata’; 9.3 LEAST SQUARES ESTIMATION OF 6.3 10 63.0 39.69
PARAMETERS 6.5 11 71.5 42.25

// This function does the preparation for PCA analysis In the previous section, we have discussed the simple regression equation with 7.1 20 142.0 50.41
// It adjusts the data to subtract the mean, finds the covariance matrix, only one regressor variable X and the variable of interest Y. We have also 7.6 21 159.6 57.76
// and finds normal eigenvectors of that covariance matrix. discussed the simple linear regression model with a single regressor variable
// It returns 4 matrices
7.8 22 171.6 60.84
X. The simple linear regression model has two unknown parameters a and b,
// meanadjust = the mean-adjust data set which are known as intercept and regression coefficient, respectively. Their 8.0 23 184.0 64.00
// covmat = the covariance matrix of the data
values are unknown. Therefore, they must be estimated using sample data. The 8.1 25 202.5 65.61
// eigvalues = the eigenvalues of the covariance matrix, IN SORTED ORDER
estimation of the parameters a and b is done by minimising the error term e.
// normaleigs = the normalised eigenvectors of the covariance matrix,
// IN SORTED ORDER WITH RESPECT TO Let (Y1, X1), (Y2, X2), …, (Yn, Xn) be n pairs of values in the data. The å X = 69.7 å Y = 147 å XY = 1086.4 å
2
X = 492.21
// THEIR EIGENVALUES, for selection for the feature vector. equation of the simple linear regression model may be written as
From the table,
Y = a + bX + e … (3)
24 å X = 69.7 å Y = 147 å XY = 1086.4 and å X 2 = 492.21
where e represents the error term which arises due to the difference of the
We obtain the intercept a and slope b from equations (19) and (20) by
observed Y and the fitted line Ŷ  â  b̂ X . We use the method of least squares
substituting the values of å X, å Y, å XY and å X 2 in them. Thus,
to minimise the error term e. From equation (3), we may write a simple
regression model as
10´ 1086.4 - (147) ( 69.7)
Yi = a + b Xi + ei i = 1, 2, …, n … (4) b̂ =
7 10 ´ 492.21 - ( 69.7) ( 69.7)
12

Regression Modelling for a sample data of n pairs of values given in terms of 1086.40  10245.9 618 .10 Simple Linear Regression
(Yi, Xi), (i = 1, 2, …n).    9.656
4922.10  4858.09 64.01
We estimate a and b so that the sum of the squares of the differences between and
//
the observed values (Yi) and the points lying on the straight line is minimum, sg
// NOTE: This function cannot handle data sets that have any eigenvalues
// equal to zero. It’s got something to do with the way that scilab treats i.e., the sum of squares of the error terms given by
1 1
// the empty matrix and zeros. n n  147  9.656  69.7   147  673.02  52.6
E   e  Yi  a  bX i  10 10
2 2
// i … (5)
function [meanadjusted,covmat,sorteigvalues,sortnormaleigs] = PCAprepare (data) i 1 i 1 The best fitted regression equation for the given data using the method of least
// Calculates the mean adjusted matrix, only for 2 dimensional data is minimum. To find the values of a and b for which the sum of squares of the squares is given as
means = mean(data,"r");
error terms, i.e., E is minimum, we differentiate it with respect to the
meanadjusted = meanadjust(data); Ŷ  52.6  9.656 X
parameters a and b and equate the results to zero:
covmat = cov(meanadjusted);
eigvalues = spec(covmat); E n We may also show the fitted regression line for the given data as in Fig. 9.2.
normaleigs = justeigs(covmat);  2 Yi  a  bX i   0 … (6)
a i 1
sorteigvalues = sorteigvectors(eigvalues’,eigvalues’);
sortnormaleigs = sorteigvectors(eigvalues’,normaleigs); and 30
Monthly Sales
E n
 2 Yi  a  bX i  X i  0
25
… (7)
b i 1 20
// This removes a specified column from a matrix
// A = the matrix Simplifying equations (6) and (7), we get 15 Monthly
// n = the column number you wish to remove n n y = 9.656x - 52.60 Sales
function [columnremoved] = removecolumn(A,n) n a  b  X i   Yi … (8) 10
inputsize = size(A); i 1 i 1
5
numcols = inputsize(2); n n n
temp = A(:,1:(n-1)); a  X i  b  X i2   Yi X i … (9) 0
for var = 1:(numcols - n) i 1 i 1 i 1 6 6.5 7 7.5 8
temp(:,(n+var)-1) = A(:,(n+var));
end, Equations (8) and (9) are called the least-squares normal equations. The Fig. 9.2: Fitted regression equation for Example 1.
columnremoved = temp; solution to these normal equations is
You may now like to solve the following problem to check your
â  Y  b̂ X … (10) understanding:
// This finds the column number that has the
where Y and X are the averages of Yi and Xi , respectively. On putting the E5) A statistician collected the sales data of a restaurant near a metro station
// highest value in it’s first row.
value of â from equation (10) in equation (9), we get for 12 months. The data comprising the number of customers (in
function [column] = highestvalcolumn(A)
inputsize = size(A); hundreds) and monthly sales (in thousand Rupees) is given as follows:
 Y X   Y   X  / n
n
numcols = inputsize(2); i i i i Month No. of Customers Monthly Sales
maxval = A(1,1); bˆ  i 1
… (11) (in hundreds) (in thousand Rs.)
 X 
n

X
maxcol = 1; 2

for var = 2:numcols i


2
 i /n 1 04 1.8
i 1
if A(1,var) > maxval 2 06 3.5
maxval = A(1,var); Since the denominator of equation (11) is the corrected sum of squares of Xi, 3 06 5.8
maxcol = var; we may rewrite it as
end, 4 08 7.8
 X 
2
end, n 5 10 8.7
   Xi  X 
n
SSX   X i2 
i 2
column = maxcol 6 14 9.8
i 1 n i 1
7 18 10.7
Similarly, the numerator is the corrected sum of the cross product of Xi and Yi
and may be rewritten as: 8 20 11.5

 Y   X    X
25 9 22 12.9
n
SSXY =  Yi X i    X Yi  Y 
i i
i
10 26 13.6
n i 1
11 28 14.2
Therefore, the expression for b̂ may be rewritten as 12 30 15.0
SS
b̂  XY … (12) Find the fitted simple linear regression equation for the given data.
SS X
8 13

Simple Linear Regression Regression Modelling


Thus, â and b̂ are the least squares estimates of the intercept a and slope b, 9.5 RESIDUAL ANALYSIS
respectively. Therefore, the fitted simple linear regression model is given by
In the previous section, we have discussed how to obtain the estimated values
// This sorts a matrix of vectors, based on the values of Ŷ  â  b̂ X … (13) of the parameters a and b for a simple linear regression model given by
// another matrix equation (1):
// Equation (13) gives a point estimate of the mean of Y for a particular X. The
// values = the list of eigenvalues (1 per column) difference between the fitted value Ŷi and Yi is known as the residual and is Y = a + bX
// vectors = The list of eigenvectors (1 per column) denoted by ri: We have also explained how to get the following fitted model after minimising
//
ri = Yi  Ŷi , i = 1, 2, …, n …(14) the error term e:
// NOTE: The values should correspond to the vectors
// so that the value in column x corresponds to the vector The role of the residuals and its analysis is very important in regression Ŷ  â  b̂ X
// in column x. modelling. We shall discuss it in detail in Sec. 9.5.
function [sortedvecs] = sorteigvectors(values,vectors) We are now interested in knowing: How well does this equation fit the data? Is
inputsize = size(values); Properties of Estimates of Parameters the model likely to be useful as a predictor? Can the fitted model be used to
numcols = inputsize(2); predict the value of the dependent variable?
highcol = highestvalcolumn(values); So far you have learnt that the estimates of parameters of a simple linear
regression model are obtained with the help of the method of least squares. On All these issues must be investigated before adopting the model for use. A
sorted = vectors(:,highcol);
putting the values of the estimates of the parameters in the equation, we get a simple but efficient method of detecting model deficiencies is to examine the
remainvec = removecolumn(vectors,highcol);
remainval = removecolumn(values,highcol);
residuals. Residuals are the differences between the observed values and the
fitted simple linear regression model. The least squares estimates â and b̂ have corresponding fitted values. Mathematically, if Yi is the ith observation of the
for var = 2:numcols several important properties, which are given below:
highcol = highestvalcolumn(remainval); dependent variable Y and Ŷi , the corresponding fitted value, then
sorted(:,var) = remainvec(:,highcol); a) The means of â and b̂ (the estimates of the parameters a and b) of a r = Y - Yˆ , i = 1, 2,…, n … (21)
i i i
remainvec = removecolumn(remainvec,highcol); simple linear regression model
remainval = removecolumn(remainval,highcol); From equation (21), note that the residual ri is the deviation between the value
end, E(Yi) = a + b Xi
given in the data and the corresponding fitted value. Upon substituting
sortedvecs = sorted; are given by ˆ  aˆ  bˆ X in equation (21), we get

Y
Eâ   a and E b̂  b ,
i i

// This takes a set of data, and subtracts


respectively. Thus, if we assume that the model is correct, then â and b̂

ri  Yi  â  b̂ X i ,  i = 1, 2,…, n … (22)
// the column mean from each column. where â and b̂ are the estimated values of the parameters a and b. As we have
function [meanadjusted] = meanadjust(Data)
are unbiased estimators of a and b.
discussed earlier, residuals play a leading role in detecting the adequacy of a
inputsize = size(Data); The variance of b̂ is given as model. In other words, residuals may be defined as realised or observed values
numcols = inputsize(2);
of the model errors.
means = mean(Data,"r");
tmpmeanadjusted = Data(:,1) - means(:,1);

Var bˆ 
2
SSx
…(15)
9.5.1 Scaling of Residuals
for var = 2:numcols
tmpmeanadjusted(:,var) = Data(:,var) - means(:,var); The variance of â is given as Analysis and plotting of residuals is an effective way of discovering the model
end,
meanadjusted = tmpmeanadjusted

Var â   Var Y  b̂ X  inadequacies, investigating how well the model fits the data and checking the
assumptions listed below:
ˆ + X 2 Var bˆ - 2X Cov(Y, b)
= Var Y
() ()
ˆ
i) the error term ei has zero mean;
ii) the error term ei has constant variance σ2;
ˆ is just Var (Y )   and the covariance between
2
Now the variance of Y iii) the errors are uncorrelated;
n
Yˆ and b̂ would be zero. Thus, we get iv) the errors are normally distributed.
1 X  2 In this section, we discuss the standardised method for scaling residuals.
Var  aˆ    2    …(16)
 n SSX  Some observations in the data are separated or very far away from the rest of
the data. These are called outliers or extreme values. The scaled residuals are
b) The sum of the residuals in any regression model that contains an helpful in finding the outliers or the extreme values.
intercept a is always zero, i.e.,
26 Residuals have zero mean and their approximate average variance is estimated
n n
ˆ )
å (
ˆ
ri = å Yi –Y i ) = å (Y - aˆ - bX
i i
by
n n

 r  r  r
i =1 i =1 2 2

= å [Yi - Y - bˆ X i - X ]  â  Y  b̂ X
i i
( ) ˆ 2  i 1
 i 1
= SSRes ( r is zero) … (23)
n k nk
= å (Yi - Y) - bˆ å X i - X = 0( ) …(17)
9
where n  k is the degree of freedom. Alternatively r i
2
can be written as
14
2 2 Simple Linear Regression Regression Modelling Ordered Ranks Cumulative Percentile Cumulative Then we follow the method of constructing a normal probability plot as Simple Linear Regression
éY - Y - bˆ X - X ù
å ri2 = å ( Y - Yˆ ) = å
i i ëê I i (
úû ) Residuals (i) Probability Probability explained in Sec 9.5.3. We arrange the standardised residuals in increasing
2 2 p i  i  1 2  n Pi   i  1 2  n   1 00 order and rank the residuals, which are arranged in increasing order. After
= å Yi - Y ( ) - 2bˆ å Yi - Y ( )( X - X) + bˆ å ( X - X)
i
2
i ranking the residuals, we calculate the cumulative probability using
−1.88807 1 0.041667 04.1667 equation (26) and percentile cumulative probability using equation (27) for the
Thus, −1.27982 2 0.125 12.5 corresponding ranked residual values.
= éêå Yi - Y X i - X + bˆ 2 å X i - X ùú / (n - k)
2 2
SSRes ( ) - 2bˆ å Yi - Y ( )( ) ( ) −0.53461 3 0.20833 20.833
ë û
−0.48012 4 0.291667 29.1667 Ordered Ranks Cumulative Percentile Cumulative
 SSY  2SSXY
2
/ SSX  SSXY
2
/ SS X −0.27835 5 0.375 37.5 Residual (i) Probability Probability
0.106038 6 0.45833 45.833 p i  i  1 2  n Pi   i  1 2  n   100
 SSY  SSXY
2
/ SSX
0.16053 7 0.541667 54.1667
1.86495 1 0.05 05
Residuals are not independent because estimates of k parameters impose k 0.413844 8 0.625 62.5
constraints on them. 0.493373 9 0.70833 70.833 0.70882 2 0.15 15

A logical method of scaling the residuals is to divide the ith error term by its 0.784978 10 0.791667 79.1667 0.3083 3 0.25 25
standard deviation that will provide the ith standardised residual di, i.e., 1.243004 11 0.875 87.5 0.26391 4 0.35 35
1.26215 12 0.95833 95.833
ri 0.12972 5 0.45 45
di  ,  i  1, 2, 3...n … (24)
SSRe s Then we plot the percentile cumulative probability against the standardised
 is the symbol used residuals to obtain the normal probability plot. The resulting points would lie
0.092215 6 0.55 55
This method is known as the standardised method of scaling the residuals to denote the phrase approximately on a straight line as shown in Fig. 9.11. 0.315183 7 0.65 65
and di is called the ith standardised residual. The standardised residuals have “for every”.
0.35957 8 0.75 75
zero mean and their variance is approximately 1, i.e.,
Normal Probability Plot 0.760086 9 0.85 85
E(ri )
E  di   0 1.738667 10 0.95 95
SSRes
2
Then we plot the percentile cumulative probability versus the standardised
 1  residuals to obtain the normal probability plot. The resulting points would lie
and Var  d i     Var  ri  1 ... (25) approximately on a straight line as shown in Fig. 9.16.
 SS 
 Re s 

The values of the standardised residuals indicate the positions of the observed
data. A large value of a standardised residual indicates its distance from the
Normal Probability Plot
fitted value of the model. Consequently a very large value of a standardised
residual, i.e., d i  3 indicates an extreme observation or an outlying observation.
Fig. 9.11: Normal probability plot for the given data.

9.5.2 Residual Plots From Fig. 9.11, note that some points of the distribution deviate slightly from
the straight line, but do not lie very far from the central points.
A very effective way to investigate the adequacy of the fit of a regression
model and to check the underlying assumptions is the graphical analysis of On the basis of the above residual analysis for the given data, the best fitted
residuals. The basic residual plots are generated by plotting the residual values regression equation is obtained as shown in Fig. 9.12.
against the predicted Y values. These should be examined in all regression
modelling problems.
30 Fitted Y values
A satisfactory residual plot should be more or less a horizontal band of points
25
shown in Fig. 9.3.
20
15 Fitted Y values
10
Linear (Fitted Fig. 9.16: Normal probability plot for the given data.
5 Y values )
Fig. 9.3 From the normality probability plot (Fig. 9.16), you may note that some
0
points of the distribution are deviating slightly from the straight line, but do
6 6.5 7 7.5 8 not lie very far from the central points.
A Heteroscadastic data (Non-constant variance) will have a residual plot as
shown in Fig. 9.4. 15 Fig. 9.12: Best fitted regression equation after residual analysis for the given data. 25
20

Regression Modelling You may now like to solve the following questions to assess your Simple Linear Regression Regression Modelling On the basis of the above residual analysis for the given data, we obtain the best
understanding. fitted regression equation shown in Fig. 9.17.
E6) Describe the method of scaling the residuals for the simple linear
regression model Y = a + b X + e. Monthly Sales
E7) Calculate the residuals and determine the standardised residuals for the
simple linear regression model obtained for the data given in E2 of this 18
unit. Generate the corresponding residual plot and the normal probability 16
plot.
14
Fig. 9.4 We now summarise the concepts that we have discussed in this unit. 12
The trend shown in Fig. 9.5 may be exhibited for data for which there exists an 10
error in the regression calculation or some additional regression is needed in 9.6 SUMMARY Monthly Sales
8
the models. Linear (Monthly Sales)
1. Simple linear regression fits a straight line through the set of n points in 6
such a way that makes the sum of squared residuals of the model (that is, 4
vertical distances between the points of the data set and the fitted line) as
small as possible. 2
0
2. The simple linear regression model has two unknown parameters a and b,
0 10 20 30 40
which are known as intercept and regression coefficient, respectively.
Their values are unknown. Therefore, they must be estimated using the
sample data. The estimation of the parameters a and b is done by Fig. 9.17: Best fitted regression equation after residual analysis for the given data.
minimising the error term e.

Fig. 9.5 3. An essential part of regression analysis includes a careful examination


of the residuals to guarantee that the assumptions in the least squares
If the relationship between X and Y is nonlinear, the pattern of residuals theory are met. Some observations in the data are separated or very far
shown in Fig. 9.6 will be observed, i.e., a curvilinear relationship is suggested. away from the rest of the data. These are called outliers or extreme
values. The scaled residuals are helpful in finding the outliers or the
extreme values.
4. A very effective way to investigate the adequacy of the fit of a regression
model and to check the underlying assumptions is the graphical analysis of
the residuals. The basic residual plots are generated by plotting the residual
values against the predicted Y values. These should be examined in all
regression modelling problems.

Fig. 9.6 5. The values of the standardised residuals indicate the positions of the
observed data. Large values of standardised residuals indicate their
9.5.3 Normal Probability Plot distance from the fitted values of the model. Consequently a very large
value of standardised residuals, i.e., d i  3, indicates an extreme
A small departure from the normality assumptions does not affect the model
extensively. Since the confidence intervals depend on the normality observation or an outlying observation.
assumptions, the gross normality is potentially more serious. Construction of 6. A small departure from the normality assumptions does not affect the
the normal probability plot of the residuals is a very simple method to check model extensively. Since the confidence intervals depend on the normality
the normality assumptions. The method of constructing a normal probability assumptions, the gross normality is potentially more serious. Construction
plot is as follows: of the normal probability plot of the residuals is a very simple method to
1. We arrange the standardised residuals in increasing order. check the normality assumptions.

2. We rank the residuals, which are arranged in increasing order. 7. To construct the normal probability plot, we arrange the standardised
Let r1, r2, r3, …, rn be the calculated standardised residuals ranked in residuals in increasing order and rank the residuals, which are also
increasing order. arranged in increasing order. After ranking the residuals, we calculate the
percentile cumulative probability for the corresponding ranked residual
3. After ranking the residuals, we calculate the cumulative probability values.
(pi) as 21
16 26

 1 Simple Linear Regression Regression Modelling


9.7 SOLUTIONS/ANSWERS
i  
pi  
2
i  1, 2,..., n. … (26) E1) Refer to Sec. 9.2.
n
E2) Refer to Sec. 9.2.
E3) Refer to Sec. 9.3.
for the corresponding ranked residual values.
E4) Refer to Sec. 9.3.
4. Then we plot the cumulative probability by multiplying pi by 100. It is E5) Refer to Sec. 9.4.
known as percentile cumulative probability (Pi), i.e., E6) We are given n = 12. Let us assume an equation of a simple linear Machine Learning (CSPC41)
 1 regression model as
i   Y=a+bX
Pi  
2
100; i  1, 2,..., n. … (27) For finding the fitted regression equation, we first draw a scatter diagram
n
by plotting the (X, Y) points as shown Fig. 9.13.
We plot Pi versus the standardised residuals to obtain the normal probability

Linear Regression
plot. The resulting points would lie approximately on a straight line.
16
Monthly Sales
In the normal probability plot, we usually determine the straight line visually
with emphasis on the central values of all the points rather than the extreme 14
points at both ends. You should remember the following points about the 12
normal probability plot for the fitted model: 10
1. A normal probability plot is called ideal if all points lie approximately 8
along a straight line as shown in Fig. 9.7. 6 Monthly Sales
4
2
0
0 5 10 15 20 25 30 35

Fig. 9.13: Scatter diagram for the (X, Y) points given in the data.
We construct the following table for calculating the values of â and bˆ :
S. No. X Y XY X2
1 4 1.8 7.2 16
2 6 3.5 21.0 36
Fig. 9.7
3 6 5.8 34.8 36
2. A sharp upward and downward curve at both ends indicates that the end 4 8 7.8 62.4 64
points of this distribution are too light for it to be considered normal. These 5 10 8.7 87.0 100
do not constitute the normal probability plot for light-tailed distribution as 6 14 9.8 137.2 196
shown in Fig. 9.8a. A heavy-tailed distribution shows flattening at the 7 18 10.7 192.6 324
extremes, which is a pattern of samples from a distribution with heavier 8 20 11.5 230.0 400
tails than normal as shown in Fig. 9.8b. 9 22 12.9 283.8 444
10 26 13.6 353.6 676
11 28 14.2 397.6 784
12 30

 X 192  Y  115.3  XY  2257.2


15.0 450.0

X
900
2
 4016
Thanks and Acknowledgements
Substituting the values of  X,  Y,  XY and  X
2
in Dr Anoop Kumar Patel
equations (19) and (20), we get
12  2257.2  192  115.3
Department of Computer Engineering, NIT Kurukshetra 1
b̂ 
12  4016  192 192
27086.4 - 22137.6
(a) (b) = = 0.4369
48192 - 36864
Fig. 9.8 17 22

Regression Modelling 3. The positive and negative skewed normal probability plot shows the 1 Simple Linear Regression
and â  115.3   0.437 192  
pattern of an upward trend (Fig. 9.9a) and downward trend (Fig. 9.9b) of 12 
the distribution, respectively, at the ends of the distribution.
 2.6185

Introduction
Therefore, the best fitted regression equation (Fig. 9.14) is
Ù
Y = 2.6185 + 0.4369X

20
Monthly Sales
y = 0.436x + 2.618

In Machine Learning:
15

10 Monthly Sales

(a) (b)
5
Linear (Monthly
Sales)  Linear Regression is a supervised machine learning
Let us take up an example to explain the concepts of residual analysis.
Fig. 9.9 0
0 10 20 30 40 algorithm.
Example 2: For the data given in Example 1, calculate the residuals and
determine the standardised residuals for the model. Draw the corresponding
Fig. 9.14: Best fitted regression equation for the given set of (X, Y) values.

E7) Refer to Sec. 9.5.


 It tries to find out the best linear relationship that
describes the data you have.
residual plot and normal probability plot.
Solution: In Example 1, we have fitted the regression equation Y = a + b X E8) In Example 2, we have fitted the simple linear regression equation
for the given set of data. We have obtained the best fitted regression equation Y = a + bX + e for the given set of data and obtained the best fitted

 It assumes that there exists a linear relationship


as regression equation as
Ŷ  2.6185  0.4369X
Ŷ  52.6  9.656X

between a dependent variable and independent


From the above equation of regression, we calculate the fitted values of
We now calculate the fitted value of the dependent variable from the best the dependent variable Y, i.e., Ŷi for all i, and obtain the values of
fitted regression line given above. We also calculate the values of residuals
residuals, i.e., ri for all i, as given in the following table:

variable(s).
and their squares for all values of Y given in the data by substituting the values
of X, and arrange them in a table as follows:
S. No. Y Values Ŷ Values ri ri2

 The value of the dependent variable of a linear


Y Ŷ ri = Yi − Ŷ ri 2 1 1.8 4.36 −2.56 (−2.56)2 = 6.5536

1 5.34 − 4.34 (−4.34)2 = 18.83 2 3.5 5.24 −1.74 (−1.74)2 = 3.0276

regression model is a continuous value i.e. real


6 6.30 − 0.30 (−0.30)2 = 0.09 3 5.8 5.24 +0.56 (+0.56)2 = 0.3136

8 7.26 0.74 2
(+0.74) = 0.55 4 7.8 6.11 +1.69 (+1.69)2 = 2.8561
(+1.71)2 = 2.9241

numbers.
10 8.23 1.77 (+1.77)2 = 3.13 5 8.7 6.99 +1.71

11 10.16 +0.84 (+0.84)2 = 0.70 6 9.8 8.73 +1.07 (+1.07)2 = 1.1449


20 15.95 +4.05 (+4.05)2 = 16.40 7 10.7 10.48 +0.22 (+0.22)2 = 0.0484
21 20.78 +0.22 (+0.22)2 = 0.48 8 11.5 11.36 +0.14 (+0.14)2 = 0.0196
22 22.72 − 0.72 (−0.72)2 = 0.52 9 12.9 12.23 +0.67 (+0.67)2 = 0.4489
23 24.65 −1.65 (−1.65)2 = 2.72 10 13.6 13.98 −0.38 (−0.38)2 = 0.1444
25 25.61 − 0.61 (−0.61)2 = 0.37 11 14.2 14.85 −0.65 (−0.65)2 = 0.4225 2
Total 147.00 0.00
r  r  −0.73 (−0.73)2 = 0.5329
2 12 15.0 15.73
i = 43.79
Total å Yi = 115.30 å Ŷi = 115.30 å r = 0.00i  r 2  18.4366
i
Let us check how well the regression line fits the given data.
18 23

We calculate the standard deviation of residuals from equation (23) to Simple Linear Regression Regression Modelling To understand how well the regression line fits the given data, we first
calculate the standardised residuals for the given data as follows: calculate the standard deviation of the residuals:

r 2
43.79 SS1/Re2s 
r i
2

Introduction
SS1/Re2s  i
  5.47 = 2.34 n k
n k 8
18.4366
Here n – k = 10 − 2 = 8. =
10
We now construct a table which contains the values of residuals and the
= 1.84366 = 1.3578
standardised residuals as follows:
where n – k =10
S. Residuals d i = ri / SS Res
No. ri
We now construct the following table for obtaining the values of
standardised residuals for all values of the residuals:
1 −4.34 −4.34/2.34 = − 1.855
2
3
−0.30
−0.74
−0.30/2.34 = − 0.128
+0.74/2.34 = + 0.316
S. No. Residuals ri Standardised Residuals
d i  ri / SSRe s  Regression model provides relationship between
one dependent variable and explanatory variables.
4 +1.77 +1.77/2.34 = + 0.756 1 −2.56 −1.8854
5 +0.84 +0.84/2.34 = + 1.036
2 −1.74 −1.28148

 Use equation to setup relationship


6 +4.05 +4.05/2.34 = + 1.730
3 +0.56 + 0.41243
7 +0.22 +0.22/2.34 = + 0.094
4 +1.69 + 1.24466
8 −0.72 −0.72/2.34 = − 0.307

o Numerical Dependent (Response) Variable


5 +1.71 + 1.25939
9 −1.65 −1.65/2.34 = − 0.705
6 +1.07 + 0.78803
10 −0.61 −0.61/2.34 = − 0.260

1 or More Numerical or Categorical


7 +0.22 + 0.16202
r  0.00 d  0.000
o
Total
i i
8 +0.14 + 0.10310
9 +0.67 + 0.49344

Independent (Explanatory) variables.


Note that the sum of the residuals is zero. We now plot the standardised
residuals against the fitted values of Y and obtain a residual plot as shown in 10 −0.38 − 0.27986
Fig. 9.10. 11 −0.65 −0.47871
12 −0.73 −0.53763
Residual Plot Total
r  0.00  d i  0.00
i
2
The standardised residuals are obtained as above and it is observed
1
that their total is zero. We plot these residuals against the fitted values
0 of Y as shown in Fig. 9.15.
Residual Plot
-1
Residual Plot
-2 1.5
1
-3 0.5
0
Fig. 9.10: Residual plot for the given data.
-0.5 Residual Plot
We follow the method explained in Sec.9.5.3 to construct the normal -1
probability plot. We arrange the standardised residuals in increasing order and -1.5 3
rank the residuals, which are also arranged in increasing order. After ranking -2
the residuals, we calculate the percentile cumulative probability using -2.5
equation (27) for the corresponding ranked residual values.
19 Fig. 9.15: Residual plot for the given data.
24
Representing Linear Regression Model Linear Regression Y-intercept and slope of a line
Linear regression model represents the linear relationship
between a dependent variable and independent variable(s)
via a sloped straight line. • A (Simple) Regression Model that gives
the straight-line relationship between two
variables is called a linear-regression
model.
The sloped straight line representing
the linear relationship that fits the
given data best is called as a
regression line.

It is also called as best fit line.

4 9 14

Correlation vs. Regression Types of Relationships Simple Linear Regression Model

 A scatter diagram can be used to show the Linear relationships Curvilinear relationships

relationship between two variables Y Y Population Random


Population Independent Error
Slope
 Correlation analysis is used to measure strength of Y intercept
Coefficient
Variable term
the association (linear relationship) between two Dependent
Variable

Yi  β0  β1Xi  εi
variables
X X
 Correlation is only concerned with strength of the
relationship Y Y

 No causal effect is implied with correlation Linear component Random Error


component

X X
5 10 15

Regression Analysis Types of Relationships Simple Linear Regression Model

Strong relationships Weak relationships


 Regression analysis is used to:
Y Yi  β0  β1Xi  εi
 Predict the value of a dependent variable based on the Y Y
value of at least one independent variable Observed Value
of Y for Xi
 Explain the impact of changes in an independent
variable on the dependent variable εi Slope = β1
X X Predicted Value
Dependent variable: the variable we wish to predict or Random Error
of Y for Xi
explain Y Y for this X i value

Independent variable: the variable used to explain Intercept = β0


the dependent variable

X X
Xi X
6 11 16

Simple Linear Regression Model Types of Relationships Simple Linear Regression Equation
(Prediction Line)
No relationship The simple linear regression equation provides an
 Only one independent variable, X estimate of the population regression line
Y
 Relationship between X and Y is Estimated
described by a linear function (or predicted) Estimate of Estimate of the
Y value for the regression regression slope
 Changes in Y are assumed to be caused by observation i
X intercept
changes in X Value of X for

 b0  b1Xi
Y observation i

The individual random error terms ei have a mean of zero


X
7 12 17

Types of Regression Estimate b0 and b1 from data (Method 1 )


Plotting a linear equation
• Based on the number of independent variables, there are two
types of linear regression- How to find b0 and b1 from the given data set?
1. Simple Linear Regression •Least square regression line = b0+b1x can be
2. Multiple Linear Regression generated by finding b0 and b1 from the data
Regression •  and
•Estimation of b1 and b0 through this method is
Simple Multiple
Regression Regression called Method of Ordinary Least Square (OLS)

Simple Multiple Multiple


Simple Linear
Non-linear Linear Non-Linear
Regression
Regression Regression Regression

8 13 18
Simple Linear Regression Analysis The Least Squares Line: Method 2 Coefficient of Determination, r2
How to find b0 and b1 from the given data set?
• S1: Scattered Diagram •Least square regression line = b0+b1x can be
generated by finding b0 and b1 from the data  The coefficient of determination is the portion of
• S2: Find Least Square Line the total variation in the dependent variable that is
• S3: Interpretation of a and b • and explained by variation in the independent variable
where  The coefficient of determination is also called r-
• S4: Assumptions of the Regression Model squared and is denoted as r2
•SSxy = and SSxx=
SSR regression sum of squares
r2 = =
•SS Stands for “sum of squares”. The least square SST total sum ofsquares
regression line is also called “Regression of y on X”
note: 0  r2  1
19 24 Slid
e-
29

S1: Scatter Diagram Least Squares Method Standard Deviation of Random Errors
• A plot of paired observations is called a
scattered diagram.
 and are obtained by finding the values of and
that minimize the sum of the squared differences
between Y and Ŷ:
min  (Yi Yˆ i )2  min  (Yi  (  X ))2
i

20 25 30

Scatter Diagram and Straight Line Interpretation of the Slope and the Comparing Standard Errors
Intercept
SYX is a measure of the variation of observed
Y values from the regression line

 is the estimated average value of Y when the Y Y

value of X is zero

small sYX X large sYX X


 is the estimated change in the average value
of Y as a result of a one-unit change in X The magnitude of SYX should always be judged relative to the
size of the Y values in the sample data

21 26 Slid
e-
31

Regression Line & Random Errors Measures of Variation

 Total variation is made up of two parts:


SST  SSR  SSE
Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST   (Y  Y)
i
2
SSR  (Yˆ i  Y)2 SSE   (Y  Yˆ )
i i
2

where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable

22 i = Predicted value of Y for the given X i value Slid 32
e-
27

Error Sum of Squares Measures of Variation Simple Linear Regression Example

 A real estate agent wishes to examine the


 SST = total sum of squares relationship between the selling price of a home
and its size (measured in square feet)
 Measures the variation of the Y values around their
i
mean Y
 SSR = regression sum of squares  A random sample of 10 houses is selected
 Explained variation attributable to the relationship  Dependent variable (Y) = house price in $1000s
between X and Y
 Independent variable (X) = square feet
 SSE = error sum of squares
 Variation attributable to factors other than the
relationship between X and Y

23 Slid 33
e-
28
Sample Data for House Price Model Interpretation of the slop coefficient, b1 Assumptions of Linear Regression Model

House Price in $1000s Square Feet


(Y) (X) house price 98.24833  0.10977 (square feet)
245 1400
312 1600
279 1700  b1 measures the estimated change in the
308 1875 average value of Y as a result of a one-
199 1100
219 1550 unit change in X
405 2350  Here, b1 = .10977 tells us that the average value of a
324 2450
house increases by .10977($1000) = $109.77, on
319 1425
average, for each additional one square foot of size
255 1700

34 Slid 44
e-
39

Graphical Presentation Predictions using Regression Analysis Standard Deviation of Random Errors

Predict the price for a house


 House price model: scatter plot with 2000 square feet:
450
400
house price  98.25  0.1098 (sq.ft.)
House Price ($1000s)

350
300
250
200  98.25  0.1098(2000)
150
100
50  317.85
0
0 500 1000 1500 2000 2500 3000
The predicted price for a house with 2000
Square Feet
square feet is 317.85($1,000s) = $317,850
35 Slid 45
e-
40

Calculations for b0 and b1 Assumptions of Linear Regression Model Degree of Freedom?


• Calculate b1 using the formula discussed above  The number of independent pieces of information used
• Calculate b0 using the formula discussed above (it to calculate the statistic is called the degrees of freedom.
needs b1 also)  It is the number of values which are free to vary
• ∑x = 17150, ∑y = 2865;  When the sample size is small, there are only a few
• xbar = 1715, ybar=286.5 independent pieces of information, and therefore only a
• ∑(xy) = 5085975 ∑x ∑y = 49134750 few degrees of freedom.
• ∑(x2)= 30983750 (∑x)2= 294122500  When the sample size is large, there are many
• SSxy = ? SSxx = ? independent pieces of information, and therefore many
• b1 = ? b0 = ? degrees of freedom.
 In above eqn, n-2 used because 2 degrees are lost due to
the estimation of 2 parameters (b0 & b1)
36 41 46

Graphical Presentation Assumptions of Linear Regression Model Impact of Degree of Freedom?

 House price model: scatter plot and


regression
450
line
400
House Price ($1000s)

350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price  98.24833  0.10977 (square feet)


Source: https://www.scribbr.com/statistics/degrees-of-freedom/
37 42 47

Interpretation of the Intercept, b0 Assumptions of Linear Regression Model Impact of Degree of Freedom?

house price  98.24833  0.10977 (squarefeet)

 b0 is the estimated average value of Y when the


value of X is zero (if X = 0 is in the range of
observed X values)
 Here, no houses had 0 square feet, so b 0 = 98.24833

just indicates that, for houses within the range of


sizes observed, $98,248.33 is the portion of the
house price not explained by square feet

Slid 43 48
e-
38
Machine Learning (CSPC41) Support Vector Regression

Other Types of Regression

1 11

Types of Regression Ridge Regression


 Ridge regression: a model tuning method that is used
 Linear Regression
to analyse any data that suffers from multicollinearity
 Logistic Regression  When the issue of multicollinearity occurs, least-
 Polynomial Regression squares are unbiased, and variances are large, this
 Support Vector Regression results in predicted values being far away from the
 Decision Tree Regression actual values
 Random Forest Regression  RR is a regularization technique, which is used to
 Ridge Regression reduce the complexity of the model. It is also called as
L2 regularization
 Lasso Regression

2 12

Logistic Regression Ridge Regression


 Logistic regression is another supervised learning  Ridge regression is one of the most robust versions of
algorithm which is used to solve the classification linear regression in which a small amount of bias is
problems. introduced so that we can get better long term
 In classification problems, we have dependent predictions.
variables in a binary or discrete format such as 0 or 1  The amount of bias added to the model is known as
 works with the categorical variable such as 0 or 1, Yes Ridge Regression penalty. We can compute this
or No, True or False, Spam or not spam, etc penalty term by multiplying with the lambda (tunable
 predictive analysis algorithm which works on the parameter) to the squared weight of each individual
concept of probability features.
 uses sigmoid function or logistic function which is a λ
complex cost function.  Ridge regression is a regularization technique, which
f(x) = 1 / (1+ e–x) is used to reduce the complexity of the model. It is
 f(x) = 1 / (1+ e–x) 3 also called as L2 regularization 13

Logistic Regression Polynomial Regression Lasso Regression


 type of regression which models the non-linear dataset  Another regularization technique to reduce the
using a linear model. complexity of the model.
 In Polynomial regression, the original features are  LASSO (Least Absolute Shrinkage Selector Operator),
transformed into polynomial features of given degree  Similar to the Ridge Regression except that penalty
and then modeled using a linear model, means the term contains only the absolute weights instead of a
datapoints are best fitted using a polynomial line. square of weights.
 Since it takes absolute values, hence, it can shrink the
 Uses the concept of threshold levels, values above the
slope to 0, whereas Ridge Regression can only shrink
threshold level are rounded up to 1, and values below
it near to 0.
the threshold level are rounded up to 0.
 It is also called as L1 regularization. The equation for
 There are three types of logistic regression:
Lasso regression will be
 Binary(0/1, pass/fail)
λ
 Multi(cats, dogs, lions)
 Ordinal(low, medium, high) 4 9 14

Support Vector Regression


 SVM is a supervised learning algorithm which can be
used for regression as well as classification problems.
 SVR is a regression algorithm which works for
continuous variables. Below are some keywords which Classification
are used in Support Vector Regression: kernel,
hyperplane, boundary line, support vectors
 Study in detail during classification, we always try to
determine a hyperplane with a maximum margin, so Ref, Source & Ack:
that maximum number of datapoints are covered in 1. Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign &
that margin. The main goal of SVR is to consider the Simon Fraser University, ©Han, Kamber & Pei. All rights reserved.
maximum datapoints within the boundary lines and the 2. Manranjan Pradhan, U. Dinesh, Purvi Tiwari, book “Machine Learning using Python”,

hyperplane (best-fit line) must contain a maximum 1

number of datapoints 10
Chapter 8. Classification: Basic Concepts Process (2): Using the Model in Prediction
Credit Classification (Cntd.)
• Classification: Basic Concepts
• Reading and displaying few records from the dataset
• Binary Logistic Regression Classifier

• Decision Tree Induction


Testing
• Bayes Classification Methods Data Unseen Data

• Model Evaluation and Selection (Jeff, Professor, 4)


• Techniques to Improve Classification Accuracy: Ensemble Methods NAM E RANK YEARS TENURED
Tom A s s is t a n t P r o f 2 no Tenured?
M e r lis a A s s o c ia t e P r o f 7 no
G e o rg e P ro fe s s o r 5 yes
Jo seph A s s is t a n t P r o f 7 yes Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar
2

Supervised vs. Unsupervised Learning Binary Logistic Regression


Credit Classification (Cntd.)
• Supervised learning (classification) • A statistical model in which the response variable takes only two • There are few columns which are categorical and have been inferred as
objects
– Supervision: The training data (observations, measurements, etc.) are values, 0 or 1.
1. A11 : … < 0 DM
accompanied by labels indicating the class of the observations • The explanatory variable can either be continuous or discrete
– New data is classified based on the training set • Assume that the outcomes are called positive (Y=1) and 2. A12 : 0 <= … < 200 DM
negative (Y=0). The probability that a record belongs to a 3. A13 : … >= 200 DM/salary assignments for at least 1 year
• Unsupervised learning (clustering)
positive class, using 4. A14 : no checking account
– The class labels of training data is unknown
the binary logistic regression model is given by • Finding the number of observations in the dataset for good/ bad credit
– Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

Machine Learning using Python by Manaranjan Pradhan &


Dinesh Kumar
8

Prediction Problems: Classification vs. Numeric Prediction Binary Logistic Regression (Cntd.)
Credit Classification (Cntd.)
• Classification • The logistic regression has an S-shaped curve • Creating features and storing all independent variables
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and the values (class
• • Solving for Z, above equation can be written as
labels) in a classifying attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval:
• The left hand side of the above equation is known as logit
– Medical diagnosis: if a tumor is cancerous or benign function; the right-hand side is a linear function
– Fraud detection: if a transaction is fraudulent • Such models are called generalized linear models (GLM)
– Web page categorization: which category it is • The errors may not follow normal distribution
Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar
9

Classification—A Two-Step Process


Credit Classification Credit Classification (Cntd.)
• Model construction: describing a set of predetermined classes • Encoding Categorical Features
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label • Dataset: German credit rating dataset
attribute
– The set of tuples used for model construction is training set • Available at University of California Irvin machine learning laboratory
– The model is represented as classification rules, decision trees, or mathematical formulae to predict whether a credit is a good or bad credit.
• Model usage: for classifying future or unknown objects • Predicting the probability of default, when a customer applies for a
– Estimate accuracy of the model
loan in a bank.
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model • The data contains several attributes of the persons who availed the
• Test set is independent of training set (otherwise overfitting) credit.
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar

Process (1): Model Construction


Credit Classification (Cntd.) Credit Classification (Cntd.)
• For example, checkin_acc variable is encoded into three dummy
Classification
Algorithms
variables and the base category.
Training 1. checkin_acc_A12
Data 2. checkin_acc_A13
3. checkin_acc_A14

NAM E RANK YEARS TENURED Classifier


M ik e A s s is t a n t P r o f 3 no (Model)
M a ry A s s is t a n t P r o f 7 yes
B ill P ro fe s s o r 2 yes
J im A s s o c ia t e P r o f 7 yes
IF rank = ‘professor’
D a ve A s s is t a n t P r o f 6 no
OR years > 6
Anne A s s o c ia t e P r o f 3 no
THEN tenured = ‘yes’ Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar
Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)
• Building logistic regression model • Defining the method to get significant variables • Predicting on Test Data
1. Set X (features) and Y (outcome) variables.
2. Add a new column and set its value to 1, to estimate the intercept.
3. Regression parameters are estimated using maximum likelihood estimation.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


• Splitting Dataset into Training and Test Sets
• Building Logistic regression model using only significant variables • Iterating through predicted probability of each observation and
tagging as bad credit (1) if probability value is more than 0.5 or as
good credit (0) otherwise.

• Building Logistic Regression Model

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


• Printing Model Summary • Creating a Confusion Matrix
A matrix formed by checking the actual values and predicted values of
observations in the dataset.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


• Observations from the model output
1. The negative sign in coefficient value indicates that as the value this
variable increases, the probability of being a bad credit decreases.
2. The log of odds ratio or probability of being a bad credit increases
as duration, amount, inst_rate increases.
3. The probability of being a bad credit decreases as age increases.
This means older people tend to pay back their credits ontime
compared to younger people.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.)


• Observations from confusion matrix
• Model Diagnostics 1. Left-top quadrant represents actual bad credit and is correctly
1. Wald’s test (Chi-square test) for checking the statistical significance classified as bad credit. This is called True Positive (TP).
of individual predictor variables. 2. Left-down quadrant represents actual good credit and is incorrectly
2. Likelihood ration test for checking the statistical significance of the classified as bad credit. This called False Positive (FP).
overall model. It is used for variable (feature) selection. 3. Right-top quadrant represents actual bad credit and is incorrectly
3. Pseudo R2 : It is a measure of goodness of the model. It does not classified as good credit. This is called False Negative (FN).
have the same interpretation of R2 as in the MLR model. 4. Right-down quadrant represents actual good credit and is correctly
classified as good credit. This is called True Negative (TN).

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar
Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)
• Measuring Accuracies • Creating the ROC curve
1. Sensitivity or Recall - the conditional probability that the predicted
class is positive given that the actual class is positive.

2. Specificity - the conditional probability that the predicted class is


negative given that the actual class is negative

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


3. Precision - the conditional probability that the actual value is
positive given that the prediction by the model is positive.

4. F-Score - a measure that combines precision and recall.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


1. Cost-based Approach
• Calculating measures in Python • A measure that finds the cut-off where the total penalty cost is minimum.
• Assuming cost of a false positive is C1 and that of a false negative is C2
• Total cost will be
Total cost = FN * C1 + FP * C2

• Model with higher AUC is preferred and AUC is frequently used for
• The model is very good at identifying the good credits, but not very model selection.
good at identifying bad credits. • As a thumb rule. AUC of at least 0.7 is required for practical application
• This the result for cut-off probability of 0.5%. This can be improved by of the model.
choosing the right cut-off probability.
Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


• Finding Optimal Classification Cut-off • Finding the optimal cut-off probability at lowest cost
• Distributions of predicted probabilities for bad/ good credit 1. Youden’s Index
• A classification cut-off probability for which the following function is maximized
(aka J-statistic)
• Select the cut-off probability for which (sensitivity + specificity - 1) is maximum.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar

Credit Classification (Cntd.) Credit Classification (Cntd.) Credit Classification (Cntd.)


• Lowest cost is achieved at cut-off probability of 0.14 if false negatives are
• Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC) assumed to be five times costlier than false positives.

1. ROC curve can be used to understand the overall performance of a


logistic regression model and used for model selection.
2. ROC curve is a plot between sensitivity on the vertical axis and 1-
specificity on the horizontal axis.

Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan & Machine Learning using Python by Manaranjan Pradhan &
Dinesh Kumar Dinesh Kumar Dinesh Kumar
8.2 Decision Tree Induction 331 336 Chapter 8 Classification: Basic Concepts

age? Incremental versions of decision tree induction have also been proposed. When
given new training data, these restructure the decision tree acquired from learning on
previous training data, rather than relearning a new tree from scratch.
youth middle_aged senior Differences in decision tree algorithms include how the attributes are selected in
creating the tree (Section 8.2.2) and the mechanisms used for pruning (Section 8.2.3).
student? yes credit_rating? The basic algorithm described earlier requires one pass over the training tuples in D for
each level of the tree. This can lead to long training times and lack of available memory
no yes fair excellent when dealing with large databases. Improvements regarding the scalability of decision
tree induction are discussed in Section 8.2.4. Section 8.2.5 presents a visual interactive
no
approach to decision tree construction. A discussion of strategies for extracting rules
yes no yes
from decision trees is given in Section 8.4.2 regarding rule-based classification.

Figure 8.2 A decision tree for the concept buys computer, indicating whether an AllElectronics cus-
8.2.2 Attribute Selection Measures
tomer is likely to purchase a computer. Each internal (nonleaf) node represents a test on An attribute selection measure is a heuristic for selecting the splitting criterion that
an attribute. Each leaf node represents a class (either buys computer = yes or buys computer “best” separates a given data partition, D, of class-labeled training tuples into individual
= no). classes. If we were to split D into smaller partitions according to the outcomes of the
splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class). Conceptually, the “best” splitting crite-
likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf nodes rion is the one that most closely results in such a scenario. Attribute selection measures
are denoted by ovals. Some decision tree algorithms produce only binary trees (where are also known as splitting rules because they determine how the tuples at a given node
each internal node branches to exactly two other nodes), whereas others can produce are to be split.
nonbinary trees. The attribute selection measure provides a ranking for each attribute describing the
“How are decision trees used for classification?” Given a tuple, X, for which the asso- given training tuples. The attribute having the best score for the measure4 is chosen as
ciated class label is unknown, the attribute values of the tuple are tested against the the splitting attribute for the given tuples. If the splitting attribute is continuous-valued
decision tree. A path is traced from the root to a leaf node, which holds the class or if we are restricted to binary trees, then, respectively, either a split point or a splitting
prediction for that tuple. Decision trees can easily be converted to classification rules. subset must also be determined as part of the splitting criterion. The tree node created
“Why are decision tree classifiers so popular?” The construction of decision tree clas- for partition D is labeled with the splitting criterion, branches are grown for each out-
sifiers does not require any domain knowledge or parameter setting, and therefore is come of the criterion, and the tuples are partitioned accordingly. This section describes
appropriate for exploratory knowledge discovery. Decision trees can handle multidi- three popular attribute selection measures—information gain, gain ratio, and Gini index.
mensional data. Their representation of acquired knowledge in tree form is intuitive and The notation used herein is as follows. Let D, the data partition, be a training set of
generally easy to assimilate by humans. The learning and classification steps of decision class-labeled tuples. Suppose the class label attribute has m distinct values defining m
tree induction are simple and fast. In general, decision tree classifiers have good accu- distinct classes, Ci (for i = 1, . . . , m). Let Ci,D be the set of tuples of class Ci in D. Let |D|
racy. However, successful use may depend on the data at hand. Decision tree induction and |Ci,D | denote the number of tuples in D and Ci,D , respectively.
algorithms have been used for classification in many application areas such as medicine,
manufacturing and production, financial analysis, astronomy, and molecular biology. Information Gain
Decision trees are the basis of several commercial rule induction systems.
ID3 uses information gain as its attribute selection measure. This measure is based on
In Section 8.2.1, we describe a basic algorithm for learning decision trees. During
pioneering work by Claude Shannon on information theory, which studied the value or
tree construction, attribute selection measures are used to select the attribute that best
“information content” of messages. Let node N represent or hold the tuples of partition
partitions the tuples into distinct classes. Popular measures of attribute selection are
D. The attribute with the highest information gain is chosen as the splitting attribute for
given in Section 8.2.2. When decision trees are built, many of the branches may reflect
node N . This attribute minimizes the information needed to classify the tuples in the
noise or outliers in the training data. Tree pruning attempts to identify and remove such
branches, with the goal of improving classification accuracy on unseen data. Tree prun-
ing is described in Section 8.2.3. Scalability issues for the induction of decision trees 4 Depending on the measure, either the highest or lowest score is chosen as the best (i.e., some measures

strive to maximize while others strive to minimize).

332 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 337

from large databases are discussed in Section 8.2.4. Section 8.2.5 presents a visual mining resulting partitions and reflects the least randomness or “impurity” in these parti-
approach to decision tree induction. tions. Such an approach minimizes the expected number of tests needed to classify

8
a given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.
The expected information needed to classify a tuple in D is given by
8.2.1 Decision Tree Induction
Classification: Basic Concepts During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, Info(D) = −
m
X
pi log2 (pi ), (8.1)
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). This work i=1
expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin,
and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and
benchmark to which newer supervised learning algorithms are often compared. In 1984, is estimated by |Ci,D |/|D|. A log function to the base 2 is used, because the information
a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published is encoded in bits. Info(D) is just the average amount of information needed to identify
the book Classification and Regression Trees (CART), which described the generation of the class label of a tuple in D. Note that, at this point, the information we have is based
Classification is a form of data analysis that extracts models describing important data classes. binary decision trees. ID3 and CART were invented independently of one another at solely on the proportions of tuples of each class. Info(D) is also known as the entropy
Such models, called classifiers, predict categorical (discrete, unordered) class labels. For around the same time, yet follow a similar approach for learning decision trees from of D.
example, we can build a classification model to categorize bank loan applications as either training tuples. These two cornerstone algorithms spawned a flurry of work on decision Now, suppose we were to partition the tuples in D on some attribute A having v dis-
safe or risky. Such analysis can help provide us with a better understanding of the data at tree induction. tinct values, {a1 , a2 , . . . , av }, as observed from the training data. If A is discrete-valued,
large. Many classification methods have been proposed by researchers in machine learn- ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which deci- these values correspond directly to the v outcomes of a test on A. Attribute A can be used
ing, pattern recognition, and statistics. Most algorithms are memory resident, typically sion trees are constructed in a top-down recursive divide-and-conquer manner. Most to split D into v partitions or subsets, {D1 , D2 , . . . , Dv }, where Dj contains those tuples in
assuming a small data size. Recent data mining research has built on such work, develop- algorithms for decision tree induction also follow a top-down approach, which starts D that have outcome aj of A. These partitions would correspond to the branches grown
ing scalable classification and prediction techniques capable of handling large amounts of with a training set of tuples and their associated class labels. The training set is recur- from node N . Ideally, we would like this partitioning to produce an exact classification
disk-resident data. Classification has numerous applications, including fraud detection, sively partitioned into smaller subsets as the tree is being built. A basic decision tree of the tuples. That is, we would like for each partition to be pure. However, it is quite
target marketing, performance prediction, manufacturing, and medical diagnosis. algorithm is summarized in Figure 8.3. At first glance, the algorithm may appear long, likely that the partitions will be impure (e.g., where a partition may contain a collection
We start off by introducing the main ideas of classification in Section 8.1. In the but fear not! It is quite straightforward. The strategy is as follows. of tuples from different classes rather than from a single class).
rest of this chapter, you will learn the basic techniques for data classification such as How much more information would we still need (after the partitioning) to arrive at
how to build decision tree classifiers (Section 8.2), Bayesian classifiers (Section 8.3), and The algorithm is called with three parameters: D, attribute list, and Attribute an exact classification? This amount is measured by
rule-based classifiers (Section 8.4). Section 8.5 discusses how to evaluate and compare selection method. We refer to D as a data partition. Initially, it is the complete set v
different classifiers. Various measures of accuracy are given as well as techniques for of training tuples and their associated class labels. The parameter attribute list is a
X |Dj |
InfoA (D) = × Info(Dj ). (8.2)
obtaining reliable accuracy estimates. Methods for increasing classifier accuracy are pre- list of attributes describing the tuples. Attribute selection method specifies a heuris- |D|
j=1
sented in Section 8.6, including cases for when the data set is class imbalanced (i.e., tic procedure for selecting the attribute that “best” discriminates the given tuples
where the main class of interest is rare). according to class. This procedure employs an attribute selection measure such as |D |
The term |D|j acts as the weight of the jth partition. InfoA (D) is the expected informa-
information gain or the Gini index. Whether the tree is strictly binary is generally tion required to classify a tuple from D based on the partitioning by A. The smaller the
driven by the attribute selection measure. Some attribute selection measures, such as
8.1 Basic Concepts the Gini index, enforce the resulting tree to be binary. Others, like information gain,
do not, therein allowing multiway splits (i.e., two or more branches to be grown from
expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
We introduce the concept of classification in Section 8.1.1. Section 8.1.2 describes the a node). obtained after partitioning on A). That is,
general approach to classification as a two-step process. In the first step, we build a clas-
The tree starts as a single node, N , representing the training tuples in D (step 1).3
sification model based on previous data. In the second step, we determine if the model’s Gain(A) = Info(D) − InfoA (D). (8.3)
accuracy is acceptable, and if so, we use the model to classify new data.
3 The partition of class-labeled training tuples at node N is the set of tuples that follow a path from
In other words, Gain(A) tells us how much would be gained by branching on A. It is
the expected reduction in the information requirement caused by knowing the value of
8.1.1 What Is Classification? the root of the tree to node N when being processed by the tree. This set is sometimes referred to in
A. The attribute A with the highest information gain, Gain(A), is chosen as the splitting
the literature as the family of tuples at node N . We have referred to this set as the “tuples represented
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” at node N ,” “the tuples that reach node N ,” or simply “the tuples at node N .” Rather than storing the attribute at node N . This is equivalent to saying that we want to partition on the attribute
and which are “risky” for the bank. A marketing manager at AllElectronics needs data actual tuples at a node, most implementations store pointers to these tuples. A that would do the “best classification,” so that the amount of information still required
to finish classifying the tuples is minimal (i.e., minimum InfoA (D)).
Data Mining: Concepts and Techniques 327
c 2012 Elsevier Inc. All rights reserved.

328 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 333 338 Chapter 8 Classification: Basic Concepts

analysis to help guess whether a customer with a given profile will buy a new computer. Algorithm: Generate decision tree. Generate a decision tree from the training tuples of Table 8.1 Class-Labeled Training Tuples from the AllElectronics Customer Database
data partition, D.
A medical researcher wants to analyze breast cancer data to predict which one of three RID age income student credit rating Class: buys computer
specific treatments a patient should receive. In each of these examples, the data analysis Input:
1 youth high no fair no
task is classification, where a model or classifier is constructed to predict class (categor-
Data partition, D, which is a set of training tuples and their associated class labels; 2 youth high no excellent no
ical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the
attribute list, the set of candidate attributes; 3 middle aged high no fair yes
marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.
4 senior medium no fair yes
These categories can be represented by discrete values, where the ordering among values Attribute selection method, a procedure to determine the splitting criterion that “best”
has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments partitions the data tuples into individual classes. This criterion consists of a 5 senior low yes fair yes
A, B, and C, where there is no ordering implied among this group of treatment regimes. splitting attribute and, possibly, either a split-point or splitting subset. 6 senior low yes excellent no
Suppose that the marketing manager wants to predict how much a given customer 7 middle aged low yes excellent yes
Output: A decision tree. 8 youth medium no fair no
will spend during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or Method: 9 youth low yes fair yes
ordered value, as opposed to a class label. This model is a predictor. Regression analysis (1) create a node N ; 10 senior medium yes fair yes
is a statistical methodology that is most often used for numeric prediction; hence the (2) if tuples in D are all of the same class, C, then 11 youth medium yes excellent yes
two terms tend to be used synonymously, although other methods for numeric predic- (3) return N as a leaf node labeled with the class C; 12 middle aged medium no excellent yes
tion exist. Classification and numeric prediction are the two major types of prediction (4) if attribute list is empty then 13 middle aged high yes fair yes
problems. This chapter focuses on classification. (5) return N as a leaf node labeled with the majority class in D; // majority voting 14 senior medium no excellent no
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
8.1.2 General Approach to Classification (8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees Example 8.1 Induction of a decision tree using information gain. Table 8.1 presents a training set,
“How does classification work?” Data classification is a two-step process, consisting of a (9) attribute list ← attribute list − splitting attribute; // remove splitting attribute D, of class-labeled tuples randomly selected from the AllElectronics customer database.
learning step (where a classification model is constructed) and a classification step (where (10) for each outcome j of splitting criterion (The data are adapted from Quinlan [Qui86]. In this example, each attribute is discrete-
the model is used to predict class labels for given data). The process is shown for the // partition the tuples and grow subtrees for each partition
valued. Continuous-valued attributes have been generalized.) The class label attribute,
loan application data of Figure 8.1. (The data are simplified for illustrative purposes. (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
buys computer, has two distinct values (namely, {yes, no}); therefore, there are two dis-
In reality, we may expect many more attributes to be considered.
(13) attach a leaf labeled with the majority class in D to node N ; tinct classes (i.e., m = 2). Let class C1 correspond to yes and class C2 correspond to no.
In the first step, a classifier is built describing a predetermined set of data classes or
(14) else attach the node returned by Generate decision tree(Dj , attribute list) to node N ; There are nine tuples of class yes and five tuples of class no. A (root) node N is created
concepts. This is the learning step (or training phase), where a classification algorithm
endfor for the tuples in D. To find the splitting criterion for these tuples, we must compute
builds the classifier by analyzing or “learning from” a training set made up of database
(15) return N ; the information gain of each attribute. We first use Eq. (8.1) to compute the expected
tuples and their associated class labels. A tuple, X, is represented by an n-dimensional
information needed to classify a tuple in D:
attribute vector, X = (x1 , x2 , . . . , xn ), depicting n measurements made on the tuple
from n database attributes, respectively, A1 , A2 , . . . , An .1 Each tuple, X, is assumed to
Figure 8.3 Basic algorithm for inducing a decision tree from training tuples.
   
9 9 5 5
belong to a predefined class as determined by another database attribute called the class Info(D) = − log2 − log2 = 0.940 bits.
label attribute. The class label attribute is discrete-valued and unordered. It is categor- 14 14 14 14
ical (or nominal) in that each value serves as a category or class. The individual tuples If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
making up the training set are referred to as training tuples and are randomly sam- Next, we need to compute the expected information requirement for each attribute.
with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All Let’s start with the attribute age. We need to look at the distribution of yes and no tuples
pled from the database under analysis. In the context of classification, data tuples can be terminating conditions are explained at the end of the algorithm.
referred to as samples, examples, instances, data points, or objects.2 for each category of age. For the age category “youth,” there are two yes tuples and three
Otherwise, the algorithm calls Attribute selection method to determine the splitting no tuples. For the category “middle aged,” there are four yes tuples and zero no tuples.
criterion. The splitting criterion tells us which attribute to test at node N by deter- For the category “senior,” there are three yes tuples and two no tuples. Using Eq. (8.2),
1 Each attribute represents a “feature” of X. Hence, the pattern recognition literature uses the term fea- mining the “best” way to separate or partition the tuples in D into individual classes the expected information needed to classify a tuple in D if the tuples are partitioned
ture vector rather than attribute vector. In our discussion, we use the term attribute vector, and in our (step 6). The splitting criterion also tells us which branches to grow from node N according to age is
notation, any variable representing a vector is shown in bold italic font; measurements depicting the
with respect to the outcomes of the chosen test. More specifically, the splitting cri-
vector are shown in italic font (e.g., X = (x1 , x2 , x3 )).
terion indicates the splitting attribute and may also indicate either a split-point or
 
2 In 5 2 2 3 3
the machine learning literature, training tuples are commonly referred to as training samples. Infoage (D) = × − log2 − log2
Throughout this text, we prefer to use the term tuples instead of samples. a splitting subset. The splitting criterion is determined so that, ideally, the resulting 14 5 5 5 5

8.1 Basic Concepts 329 334 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 339

 
partitions at each branch are as “pure” as possible. A partition is pure if all the tuples 4 4 4
Classification algorithm + × − log2
in it belong to the same class. In other words, if we split up the tuples in D according 14 4 4
Training data to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting 5

3 3 2 2

partitions to be as pure as possible. + × − log2 − log2
14 5 5 5 5
name age income loan_decision The node N is labeled with the splitting criterion, which serves as a test at the node
= 0.694 bits.
Sandy Jones youth low risky
(step 7). A branch is grown from node N for each of the outcomes of the splitting
Bill Lee youth low risky criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are three
Hence, the gain in information from such a partitioning would be
Caroline Fox middle_aged high safe possible scenarios, as illustrated in Figure 8.4. Let A be the splitting attribute. A has v
Rick Field middle_aged low risky distinct values, {a1 , a2 , . . . , av }, based on the training data.
Susan Lake senior low safe Classification rules Gain(age) = Info(D) − Infoage (D) = 0.940 − 0.694 = 0.246 bits.
Claire Phips senior medium safe 1. A is discrete-valued: In this case, the outcomes of the test at node N correspond
Joe Smith middle_aged high safe directly to the known values of A. A branch is created for each known value, Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits,
... ... ... ...
IF age  youth THEN loan_decision  risky aj , of A and labeled with that value (Figure 8.4a). Partition Dj is the subset and Gain(credit rating) = 0.048 bits. Because age has the highest information gain
IF income  high THEN loan_decision  safe of class-labeled tuples in D having value aj of A. Because all the tuples in a
IF age  middle_aged AND income  low
among the attributes, it is selected as the splitting attribute. Node N is labeled with age,
THEN loan_decision  risky and branches are grown for each of the attribute’s values. The tuples are then partitioned
... accordingly, as shown in Figure 8.5. Notice that the tuples falling into the partition for
Partitioning scenarios Examples
(a) age = middle aged all belong to the same class. Because they all belong to class “yes,”
A? color? income? a leaf should therefore be created at the end of this branch and labeled “yes.” The final
decision tree returned by the algorithm was shown earlier in Figure 8.2.
Classification rules
medium
oran

low
red

high
purpl
n

a1 a2 ... av
blue
gree

ge
e

(a)

A? income?

Test data New data


A ⱕ split_ point A ⬎ split_ point ⱕ 42,000 ⬎ 42,000
name age income loan_decision (John Henry, middle_aged, low)
Loan decision? (b)
Juan Bello senior low safe
Sylvia Crest middle_aged low risky A 僆 SA ? color 僆 {red, green}?
Anne Yee middle_aged high safe
... ... ... ...
yes no yes no
risky
(b)
(c)

Figure 8.1 The data classification process: (a) Learning: Training data are analyzed by a classification
algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is
Figure 8.4 This figure shows three possibilities for partitioning tuples based on the splitting criterion,
represented in the form of classification rules. (b) Classification: Test data are used to estimate
each with examples. Let A be the splitting attribute. (a) If A is discrete-valued, then one
the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can
branch is grown for each known value of A. (b) If A is continuous-valued, then two branches
be applied to the classification of new data tuples.
are grown, corresponding to A ≤ split point and A > split point. (c) If A is discrete-valued
and a binary tree must be produced, then the test is of the form A ∈ SA , where SA is the Figure 8.5 The attribute age has the highest information gain and therefore becomes the splitting
splitting subset for A. attribute at the root node of the decision tree. Branches are grown for each outcome of age.
The tuples are shown partitioned accordingly.

330 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 335 340 Chapter 8 Classification: Basic Concepts

Because the class label of each training tuple is provided, this step is also known as given partition have the same value for A, A need not be considered in any future “But how can we compute the information gain of an attribute that is continuous-
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told partitioning of the tuples. Therefore, it is removed from attribute list (steps 8 valued, unlike in the example?” Suppose, instead, that we have an attribute A that is
to which class each training tuple belongs). It contrasts with unsupervised learning (or and 9). continuous-valued, rather than discrete-valued. (For example, suppose that instead
clustering), in which the class label of each training tuple is not known, and the number 2. A is continuous-valued: In this case, the test at node N has two possible outcomes, of the discretized version of age from the example, we have the raw values for this
or set of classes to be learned may not be known in advance. For example, if we did not corresponding to the conditions A ≤ split point and A > split point, respectively, attribute.) For such a scenario, we must determine the “best” split-point for A, where
have the loan decision data available for the training set, we could use clustering to try to the split-point is a threshold on A.
where split point is the split-point returned by Attribute selection method as part
determine “groups of like tuples,” which may correspond to risk groups within the loan of the splitting criterion. (In practice, the split-point, a, is often taken as the We first sort the values of A in increasing order. Typically, the midpoint between each
application data. Clustering is the topic of Chapters 10 and 11. midpoint of two known adjacent values of A and therefore may not actually be pair of adjacent values is considered as a possible split-point. Therefore, given v values
This first step of the classification process can also be viewed as the learning of a map- a preexisting value of A from the training data.) Two branches are grown from of A, then v − 1 possible splits are evaluated. For example, the midpoint between the
ping or function, y = f (X), that can predict the associated class label y of a given tuple X. N and labeled according to the previous outcomes (Figure 8.4b). The tuples are values ai and ai+1 of A is
In this view, we wish to learn a mapping or function that separates the data classes. Typ- partitioned such that D1 holds the subset of class-labeled tuples in D for which
ically, this mapping is represented in the form of classification rules, decision trees, or ai + ai+1
A ≤ split point, while D2 holds the rest. . (8.4)
mathematical formulae. In our example, the mapping is represented as classification 2
rules that identify loan applications as being either safe or risky (Figure 8.1a). The rules 3. A is discrete-valued and a binary tree must be produced (as dictated by the attribute
can be used to categorize future data tuples, as well as provide deeper insight into the selection measure or algorithm being used): The test at node N is of the form “A ∈ If the values of A are sorted in advance, then determining the best split for A requires
data contents. They also provide a compressed data representation. SA ?,” where SA is the splitting subset for A, returned by Attribute selection method only one pass through the values. For each possible split-point for A, we evaluate
“What about classification accuracy?” In the second step (Figure 8.1b), the model is as part of the splitting criterion. It is a subset of the known values of A. If a given InfoA (D), where the number of partitions is two, that is, v = 2 (or j = 1, 2) in Eq. (8.2).
used for classification. First, the predictive accuracy of the classifier is estimated. If we tuple has value aj of A and if aj ∈ SA , then the test at node N is satisfied. Two The point with the minimum expected information requirement for A is selected as the
were to use the training set to measure the classifier’s accuracy, this estimate would likely branches are grown from N (Figure 8.4c). By convention, the left branch out of N split point for A. D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set
be optimistic, because the classifier tends to overfit the data (i.e., during learning it may is labeled yes so that D1 corresponds to the subset of class-labeled tuples in D that of tuples in D satisfying A > split point.
incorporate some particular anomalies of the training data that are not present in the satisfy the test. The right branch out of N is labeled no so that D2 corresponds to
general data set overall). Therefore, a test set is used, made up of test tuples and their the subset of class-labeled tuples from D that do not satisfy the test.
Gain Ratio
associated class labels. They are independent of the training tuples, meaning that they The algorithm uses the same process recursively to form a decision tree for the tuples
were not used to construct the classifier. at each resulting partition, Dj , of D (step 14). The information gain measure is biased toward tests with many outcomes. That is, it
The accuracy of a classifier on a given test set is the percentage of test set tuples that prefers to select attributes having a large number of values. For example, consider an
are correctly classified by the classifier. The associated class label of each test tuple is com- The recursive partitioning stops only when any one of the following terminating attribute that acts as a unique identifier such as product ID. A split on product ID would
pared with the learned classifier’s class prediction for that tuple. Section 8.5 describes conditions is true: result in a large number of partitions (as many as there are values), each one containing
several methods for estimating classifier accuracy. If the accuracy of the classifier is con- 1. All the tuples in partition D (represented at node N ) belong to the same class just one tuple. Because each partition is pure, the information required to classify data
sidered acceptable, the classifier can be used to classify future data tuples for which the (steps 2 and 3). set D based on this partitioning would be Infoproduct ID (D) = 0. Therefore, the informa-
class label is not known. (Such data are also referred to in the machine learning liter- tion gained by partitioning on this attribute is maximal. Clearly, such a partitioning is
2. There are no remaining attributes on which the tuples may be further partitioned useless for classification.
ature as “unknown” or “previously unseen” data.) For example, the classification rules
(step 4). In this case, majority voting is employed (step 5). This involves con- C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
learned in Figure 8.1(a) from the analysis of data from previous loan applications can
verting node N into a leaf and labeling it with the most common class in D. which attempts to overcome this bias. It applies a kind of normalization to information
be used to approve or reject new or future loan applicants.
Alternatively, the class distribution of the node tuples may be stored. gain using a “split information” value defined analogously with Info(D) as
3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12).
In this case, a leaf is created with the majority class in D (step 13). v

8.2
 
X |Dj | |Dj |
Decision Tree Induction The resulting decision tree is returned (step 15).
SplitInfoA (D) = − × log2 . (8.5)
|D| |D|
j=1
Decision tree induction is the learning of decision trees from class-labeled training
The computational complexity of the algorithm given training set D is O(n × |D| ×
tuples. A decision tree is a flowchart-like tree structure, where each internal node (non- This value represents the potential information generated by splitting the training
log(|D|)), where n is the number of attributes describing the tuples in D and |D| is the
leaf node) denotes a test on an attribute, each branch represents an outcome of the data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.
number of training tuples in D. This means that the computational cost of growing a
test, and each leaf node (or terminal node) holds a class label. The topmost node in Note that, for each outcome, it considers the number of tuples having that outcome
tree grows at most n × |D| × log(|D|) with |D| tuples. The proof is left as an exercise for
a tree is the root node. A typical decision tree is shown in Figure 8.2. It represents with respect to the total number of tuples in D. It differs from information gain, which
the reader.
the concept buys computer, that is, it predicts whether a customer at AllElectronics is measures the information with respect to classification that is acquired based on the
8.2 Decision Tree Induction 341 346 Chapter 8 Classification: Basic Concepts 8.3 Bayes Classification Methods 351

same partitioning. The gain ratio is defined as Alternatively, prepruning and postpruning may be interleaved for a combined P(H|X) is the posterior probability, or a posteriori probability, of H conditioned
approach. Postpruning requires more computation than prepruning, yet generally leads on X. For example, suppose our world of data tuples is confined to customers described
Gain(A) to a more reliable tree. No single pruning method has been found to be superior over by the attributes age and income, respectively, and that X is a 35-year-old customer with
GainRatio(A) = . (8.6)
SplitInfoA (D) all others. Although some pruning methods do depend on the availability of additional an income of $40,000. Suppose that H is the hypothesis that our customer will buy a
data for pruning, this is usually not a concern when dealing with large databases. computer. Then P(H|X) reflects the probability that customer X will buy a computer
The attribute with the maximum gain ratio is selected as the splitting attribute. Note,
Although pruned trees tend to be more compact than their unpruned counterparts, given that we know the customer’s age and income.
however, that as the split information approaches 0, the ratio becomes unstable. A con-
they may still be rather large and complex. Decision trees can suffer from repetition In contrast, P(H) is the prior probability, or a priori probability, of H. For our exam-
straint is added to avoid this, whereby the information gain of the test selected must be
and replication (Figure 8.7), making them overwhelming to interpret. Repetition occurs ple, this is the probability that any given customer will buy a computer, regardless of age,
large—at least as great as the average gain over all tests examined.
when an attribute is repeatedly tested along a given branch of the tree (e.g., “age < 60?,” income, or any other information, for that matter. The posterior probability, P(H|X),
is based on more information (e.g., customer information) than the prior probability,
Example 8.2 Computation of gain ratio for the attribute income. A test on income splits the data of
P(H), which is independent of X.
Table 8.1 into three partitions, namely low, medium, and high, containing four, six, and A1 ⬍ 60? Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the
four tuples, respectively. To compute the gain ratio of income, we first use Eq. (8.5) to
yes no probability that a customer, X, is 35 years old and earns $40,000, given that we know the
obtain
… customer will buy a computer.
      A1 ⬍ 45?
4 4 6 6 4 4 P(X) is the prior probability of X. Using our example, it is the probability that a
SplitInfoincome (D) = − × log2 − × log2 − × log2 yes no
14 14 14 14 14 14 person from our set of customers is 35 years old and earns $40,000.
… “How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated
A1 ⬍ 50?
= 1.557. from the given data, as we shall see next. Bayes’ theorem is useful in that it provides
yes no a way of calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X).
From Example 8.1, we have Gain(income) = 0.029. Therefore, GainRatio(income) = Bayes’ theorem is
class A class B
0.029/1.557 = 0.019.
(a) P(X|H)P(H)
P(H|X) = . (8.10)
P(X)
Gini Index
age ⫽ youth? Now that we have that out of the way, in the next section, we will look at how Bayes’
The Gini index is used in CART. Using the notation previously described, the Gini index yes no theorem is used in the naı̈ve Bayesian classifier.
measures the impurity of D, a data partition or set of training tuples, as
m
X
student? credit_rating? 8.3.2 Naı̈ve Bayesian Classification
Gini(D) = 1 − pi2 , (8.7) yes no excellent fair The naı̈ve Bayesian classifier, or simple Bayesian classifier, works as follows:
i=1
class B credit_rating? income? class A
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple
where pi is the probability that a tuple in D belongs to class Ci and is estimated by excellent fair
low med high is represented by an n-dimensional attribute vector, X = (x1 , x2 , . . . , xn ), depicting n
|Ci,D |/|D|. The sum is computed over m classes.
income?
measurements made on the tuple from n attributes, respectively, A1 , A2 , . . . , An .
The Gini index considers a binary split for each attribute. Let’s first consider the case class A class A class B class C
where A is a discrete-valued attribute having v distinct values, {a1 , a2 , . . . , av }, occur- 2. Suppose that there are m classes, C1 , C2 , . . . , Cm . Given a tuple, X, the classifier will
ring in D. To determine the best binary split on A, we examine all the possible subsets low med high predict that X belongs to the class having the highest posterior probability, condi-
that can be formed using known values of A. Each subset, SA , can be considered as a class A class B class C tioned on X. That is, the naı̈ve Bayesian classifier predicts that tuple X belongs to the
binary test for attribute A of the form “A ∈ SA ?” Given a tuple, this test is satisfied if class Ci if and only if
the value of A for the tuple is among the values listed in SA . If A has v possible val-
P(Ci |X) > P(Cj |X) for 1 ≤ j ≤ m, j 6= i.
ues, then there are 2v possible subsets. For example, if income has three possible values, (b)
namely {low, medium, high}, then the possible subsets are {low, medium, high}, {low, Thus, we maximize P(Ci |X). The class Ci for which P(Ci |X) is maximized is called
medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and {}. We exclude the the maximum posteriori hypothesis. By Bayes’ theorem (Eq. 8.10),
power set, {low, medium, high}, and the empty set from consideration since, conceptu- Figure 8.7 An example of: (a) subtree repetition, where an attribute is repeatedly tested along a given
ally, they do not represent a split. Therefore, there are 2v − 2 possible ways to form two P(X|Ci )P(Ci )
branch of the tree (e.g., age) and (b) subtree replication, where duplicate subtrees exist P(Ci |X) = . (8.11)
partitions of the data, D, based on a binary split on A. within a tree (e.g., the subtree headed by the node “credit rating?”). P(X)

342 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 347 352 Chapter 8 Classification: Basic Concepts

When considering a binary split, we compute a weighted sum of the impurity of each followed by “age < 45?,” and so on). In replication, duplicate subtrees exist within the 3. As P(X) is constant for all classes, only P(X|Ci )P(Ci ) needs to be maximized. If the
resulting partition. For example, if a binary split on A partitions D into D1 and D2 , the tree. These situations can impede the accuracy and comprehensibility of a decision tree. class prior probabilities are not known, then it is commonly assumed that the classes
Gini index of D given that partitioning is The use of multivariate splits (splits based on a combination of attributes) can prevent are equally likely, that is, P(C1 ) = P(C2 ) = · · · = P(Cm ), and we would therefore
these problems. Another approach is to use a different form of knowledge representa- maximize P(X|Ci ). Otherwise, we maximize P(X|Ci )P(Ci ). Note that the class prior
|D1 | |D2 | tion, such as rules, instead of decision trees. This is described in Section 8.4.2, which probabilities may be estimated by P(Ci ) = |Ci,D |/|D|, where |Ci,D | is the number of
GiniA (D) = Gini(D1 ) + Gini(D2 ). (8.8) shows how a rule-based classifier can be constructed by extracting IF-THEN rules from training tuples of class Ci in D.
|D| |D|
a decision tree. 4. Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci ). To reduce computation in evaluating P(X|Ci ), the
For each attribute, each of the possible binary splits is considered. For a discrete-valued
8.2.4 Scalability and Decision Tree Induction naı̈ve assumption of class-conditional independence is made. This presumes that
attribute, the subset that gives the minimum Gini index for that attribute is selected as
the attributes’ values are conditionally independent of one another, given the class
its splitting subset. “What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In label of the tuple (i.e., that there are no dependence relationships among the
For continuous-valued attributes, each possible split-point must be considered. The other words, how scalable is decision tree induction?” The efficiency of existing decision attributes). Thus,
strategy is similar to that described earlier for information gain, where the midpoint tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively
between each pair of (sorted) adjacent values is taken as a possible split-point. The point n
small data sets. Efficiency becomes an issue of concern when these algorithms are applied Y
giving the minimum Gini index for a given (continuous-valued) attribute is taken as P(X|Ci ) = P(xk |Ci ) (8.12)
to the mining of very large real-world databases. The pioneering decision tree algorithms
the split-point of that attribute. Recall that for a possible split-point of A, D1 is the that we have discussed so far have the restriction that the training tuples should reside
k=1
set of tuples in D satisfying A ≤ split point, and D2 is the set of tuples in D satisfying in memory. = P(x1 |Ci ) × P(x2 |Ci ) × · · · × P(xn |Ci ).
A > split point. In data mining applications, very large training sets of millions of tuples are com-
The reduction in impurity that would be incurred by a binary split on a discrete- or We can easily estimate the probabilities P(x1 |Ci ), P(x2 |Ci ), . . . , P(xn |Ci ) from the
mon. Most often, the training data will not fit in memory! Therefore, decision tree training tuples. Recall that here xk refers to the value of attribute Ak for tuple X. For
continuous-valued attribute A is construction becomes inefficient due to swapping of the training tuples in and out each attribute, we look at whether the attribute is categorical or continuous-valued.
of main and cache memories. More scalable approaches, capable of handling train- For instance, to compute P(X|Ci ), we consider the following:
1Gini(A) = Gini(D) − GiniA (D). (8.9) ing data that are too large to fit in memory, are required. Earlier strategies to “save
space” included discretizing continuous-valued attributes and sampling data at each (a) If Ak is categorical, then P(xk |Ci ) is the number of tuples of class Ci in D having
node. These techniques, however, still assume that the training set can fit in memory. the value xk for Ak , divided by |Ci,D |, the number of tuples of class Ci in D.
The attribute that maximizes the reduction in impurity (or, equivalently, has the Several scalable decision tree induction methods have been introduced in recent stud-
minimum Gini index) is selected as the splitting attribute. This attribute and either (b) If Ak is continuous-valued, then we need to do a bit more work, but the cal-
ies. RainForest, for example, adapts to the amount of main memory available and applies culation is pretty straightforward. A continuous-valued attribute is typically
its splitting subset (for a discrete-valued splitting attribute) or split-point (for a to any decision tree induction algorithm. The method maintains an AVC-set (where
continuous-valued splitting attribute) together form the splitting criterion. assumed to have a Gaussian distribution with a mean µ and standard deviation
“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree node, σ , defined by
describing the training tuples at the node. The AVC-set of an attribute A at node N
Example 8.3 Induction of a decision tree using the Gini index. Let D be the training data shown gives the class label counts for each value of A for the tuples at N . Figure 8.8 shows AVC- 1 − (x−µ)2
earlier in Table 8.1, where there are nine tuples belonging to the class buys computer = g(x, µ, σ ) = √ e 2σ2 , (8.13)
sets for the tuple data of Table 8.1. The set of all AVC-sets at a node N is the AVC-group 2πσ
yes and the remaining five tuples belong to the class buys computer = no. A (root) node of N . The size of an AVC-set for attribute A at node N depends only on the number of
N is created for the tuples in D. We first use Eq. (8.7) for the Gini index to compute the so that
distinct values of A and the number of classes in the set of tuples at N . Typically, this size
impurity of D: should fit in memory, even for real-world data. RainForest also has techniques, how- P(xk |Ci ) = g(xk , µCi , σCi ). (8.14)
ever, for handling the case where the AVC-group does not fit in memory. Therefore, the
 2  2 method has high scalability for decision tree induction in very large data sets. These equations may appear daunting, but hold on! We need to compute µCi
9 5
Gini(D) = 1 − − = 0.459. BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree and σCi , which are the mean (i.e., average) and standard deviation, respectively,
14 14 of the values of attribute Ak for training tuples of class Ci . We then plug these two
algorithm that takes a completely different approach to scalability—it is not based on
the use of any special data structures. Instead, it uses a statistical technique known as quantities into Eq. (8.13), together with xk , to estimate P(xk |Ci ).
To find the splitting criterion for the tuples in D, we need to compute the Gini index “bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the For example, let X = (35, $40,000), where A1 and A2 are the attributes age and
for each attribute. Let’s start with the attribute income and consider each of the possible given training data, each of which fits in memory. Each subset is used to construct a income, respectively. Let the class label attribute be buys computer. The associated
splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in tree, resulting in several trees. The trees are examined and used to construct a new tree, class label for X is yes (i.e., buys computer = yes). Let’s suppose that age has not
partition D1 satisfying the condition “income ∈ {low, medium}.” The remaining four T ′ , that turns out to be “very close” to the tree that would have been generated if all the been discretized and therefore exists as a continuous-valued attribute. Suppose
tuples of D would be assigned to partition D2 . The Gini index value computed based on original training data had fit in memory. that from the training set, we find that customers in D who buy a computer are

8.2 Decision Tree Induction 343 348 Chapter 8 Classification: Basic Concepts 8.3 Bayes Classification Methods 353

this partitioning is buys_computer buys_computer 38 ± 12 years of age. In other words, for attribute age and this class, we have
age yes no income yes no µ = 38 years and σ = 12. We can plug these quantities, along with x1 = 35 for
youth 2 3 low 3 1
our tuple X, into Eq. (8.13) to estimate P(age = 35|buys computer = yes). For a
Giniincome ∈ {low,medium} (D)
middle_aged 4 0 medium 4 2 quick review of mean and standard deviation calculations, please see Section 2.2.
10 4 senior 3 2 high 2 2
= Gini(D1 ) + Gini(D2 ) 5. To predict the class label of X, P(X|Ci )P(Ci ) is evaluated for each class Ci . The
14 14 classifier predicts that the class label of tuple X is the class Ci if and only if
 2  2 !  2  2 ! buys_computer buys_computer
10 7 3 4 2 2
= 1− − + 1− − student yes no credit_ratting yes no P(X|Ci )P(Ci ) > P(X|Cj )P(Cj ) for 1 ≤ j ≤ m, j 6= i. (8.15)
14 10 10 14 4 4
yes 6 1 fair 6 2 In other words, the predicted class label is the class Ci for which P(X|Ci )P(Ci ) is the
= 0.443 no 3 4 excellent 3 3
maximum.
= Giniincome ∈ {high} (D).
“How effective are Bayesian classifiers?” Various empirical studies of this classifier in
Figure 8.8 The use of data structures to hold aggregate information regarding the training data (e.g., comparison to decision tree and neural network classifiers have found it to be com-
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the sub- these AVC-sets describing Table 8.1’s data) are one approach to improving the scalability of parable in some domains. In theory, Bayesian classifiers have the minimum error rate
sets {low, high} and {medium}) and 0.450 (for the subsets {medium, high} and {low}). decision tree induction. in comparison to all other classifiers. However, in practice this is not always the case,
Therefore, the best binary split for attribute income is on {low, medium} (or {high})
owing to inaccuracies in the assumptions made for its use, such as class-conditional
because it minimizes the Gini index. Evaluating age, we obtain {youth, senior} (or
independence, and the lack of available probability data.
{middle aged}) as the best split for age with a Gini index of 0.375; the attributes student BOAT can use any attribute selection measure that selects binary splits and that is Bayesian classifiers are also useful in that they provide a theoretical justification for
and credit rating are both binary, with Gini index values of 0.367 and 0.429, respectively. based on the notion of purity of partitions such as the Gini index. BOAT uses a lower other classifiers that do not explicitly use Bayes’ theorem. For example, under certain
The attribute age and splitting subset {youth, senior} therefore give the minimum bound on the attribute selection measure to detect if this “very good” tree, T ′ , is different assumptions, it can be shown that many neural network and curve-fitting algorithms
Gini index overall, with a reduction in impurity of 0.459 − 0.357 = 0.102. The binary from the “real” tree, T, that would have been generated using all of the data. It refines output the maximum posteriori hypothesis, as does the naı̈ve Bayesian classifier.
split “age ∈ {youth, senior?}” results in the maximum reduction in impurity of the tuples T ′ to arrive at T.
in D and is returned as the splitting criterion. Node N is labeled with the criterion, two BOAT usually requires only two scans of D. This is quite an improvement, even Example 8.4 Predicting a class label using naı̈ve Bayesian classification. We wish to predict the
branches are grown from it, and the tuples are partitioned accordingly. in comparison to traditional decision tree algorithms (e.g., the basic algorithm in class label of a tuple using naı̈ve Bayesian classification, given the same training data
Figure 8.3), which require one scan per tree level! BOAT was found to be two to three as in Example 8.3 for decision tree induction. The training data were shown earlier
Other Attribute Selection Measures times faster than RainForest, while constructing exactly the same tree. An additional in Table 8.1. The data tuples are described by the attributes age, income, student, and
advantage of BOAT is that it can be used for incremental updates. That is, BOAT can credit rating. The class label attribute, buys computer, has two distinct values (namely,
This section on attribute selection measures was not intended to be exhaustive. We take new insertions and deletions for the training data and update the decision tree to {yes, no}). Let C1 correspond to the class buys computer = yes and C2 correspond to
have shown three measures that are commonly used for building decision trees. These reflect these changes, without having to reconstruct the tree from scratch. buys computer = no. The tuple we wish to classify is
measures are not without their biases. Information gain, as we saw, is biased toward
multivalued attributes. Although the gain ratio adjusts for this bias, it tends to prefer X = (age = youth, income = medium, student = yes, credit rating = fair)
unbalanced splits in which one partition is much smaller than the others. The Gini index
8.2.5 Visual Mining for Decision Tree Induction
We need to maximize P(X|Ci )P(Ci ), for i = 1, 2. P(Ci ), the prior probability of each
is biased toward multivalued attributes and has difficulty when the number of classes is “Are there any interactive approaches to decision tree induction that allow us to visual- class, can be computed based on the training tuples:
large. It also tends to favor tests that result in equal-size partitions and purity in both ize the data and the tree as it is being constructed? Can we use any knowledge of our
partitions. Although biased, these measures give reasonably good results in practice. data to help in building the tree?” In this section, you will learn about an approach to P(buys computer = yes) = 9/14 = 0.643
Many other attribute selection measures have been proposed. CHAID, a decision tree decision tree induction that supports these options. Perception-based classification P(buys computer = no) = 5/14 = 0.357
algorithm that is popular in marketing, uses an attribute selection measure that is based (PBC) is an interactive approach based on multidimensional visualization techniques
on the statistical χ 2 test for independence. Other measures include C-SEP (which per- and allows the user to incorporate background knowledge about the data when building To compute P(X|Ci ), for i = 1, 2, we compute the following conditional probabilities:
forms better than information gain and the Gini index in certain cases) and G-statistic a decision tree. By visually interacting with the data, the user is also likely to develop a
P(age = youth | buys computer = yes) = 2/9 = 0.222
(an information theoretic measure that is a close approximation to χ 2 distribution). deeper understanding of the data. The resulting trees tend to be smaller than those built
Attribute selection measures based on the Minimum Description Length (MDL) using traditional decision tree induction methods and so are easier to interpret, while P(age = youth | buys computer = no) = 3/5 = 0.600
principle have the least bias toward multivalued attributes. MDL-based measures use achieving about the same accuracy. P(income = medium | buys computer = yes) = 4/9 = 0.444
encoding techniques to define the “best” decision tree as the one that requires the fewest “How can the data be visualized to support interactive decision tree construction?” P(income = medium | buys computer = no) = 2/5 = 0.400
number of bits to both (1) encode the tree and (2) encode the exceptions to the tree PBC uses a pixel-oriented approach to view multidimensional data with its class label P(student = yes | buys computer = yes) = 6/9 = 0.667

344 Chapter 8 Classification: Basic Concepts 8.2 Decision Tree Induction 349 354 Chapter 8 Classification: Basic Concepts

(i.e., cases that are not correctly classified by the tree). Its main idea is that the simplest information. The circle segments approach is adapted, which maps d-dimensional data P(student = yes | buys computer = no) = 1/5 = 0.200
of solutions is preferred. objects to a circle that is partitioned into d segments, each representing one attribute P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
Other attribute selection measures consider multivariate splits (i.e., where the par- (Section 2.3.1). Here, an attribute value of a data object is mapped to one colored pixel, P(credit rating = fair | buys computer = no) = 2/5 = 0.400
titioning of tuples is based on a combination of attributes, rather than on a single reflecting the object’s class label. This mapping is done for each attribute–value pair of
attribute). The CART system, for example, can find multivariate splits based on a lin- each data object. Sorting is done for each attribute to determine the arrangement order Using these probabilities, we obtain
ear combination of attributes. Multivariate splits are a form of attribute (or feature) within a segment. For example, attribute values within a given segment may be orga-
P(X|buys computer = yes) = P(age = youth | buys computer = yes)
construction, where new attributes are created based on the existing ones. (Attribute nized so as to display homogeneous (with respect to class label) regions within the same
× P(income = medium | buys computer = yes)
construction was also discussed in Chapter 3, as a form of data transformation.) These attribute value. The amount of training data that can be visualized at one time is approx-
other measures mentioned here are beyond the scope of this book. Additional references imately determined by the product of the number of attributes and the number of data × P(student = yes | buys computer = yes)
are given in the bibliographic notes at the end of this chapter (Section 8.9). objects. × P(credit rating = fair | buys computer = yes)
“Which attribute selection measure is the best?” All measures have some bias. It has The PBC system displays a split screen, consisting of a Data Interaction window and = 0.222 × 0.444 × 0.667 × 0.667 = 0.044.
been shown that the time complexity of decision tree induction generally increases a Knowledge Interaction window (Figure 8.9). The Data Interaction window displays
Similarly,
exponentially with tree height. Hence, measures that tend to produce shallower trees the circle segments of the data under examination, while the Knowledge Interaction
(e.g., with multiway rather than binary splits, and that favor more balanced splits) may window displays the decision tree constructed so far. Initially, the complete training set P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019.
be preferred. However, some studies have found that shallow trees tend to have a large is visualized in the Data Interaction window, while the Knowledge Interaction window
number of leaves and higher error rates. Despite several comparative studies, no one displays an empty decision tree. To find the class, Ci , that maximizes P(X|Ci )P(Ci ), we compute
attribute selection measure has been found to be significantly superior to others. Most Traditional decision tree algorithms allow only binary splits for numeric attributes. P(X|buys computer = yes)P(buys computer = yes) = 0.044 × 0.643 = 0.028
measures give quite good results. PBC, however, allows the user to specify multiple split-points, resulting in multiple P(X|buys computer = no)P(buys computer = no) = 0.019 × 0.357 = 0.007
branches to be grown from a single tree node.
Therefore, the naı̈ve Bayesian classifier predicts buys computer = yes for tuple X.
8.2.3 Tree Pruning
“What if I encounter probability values of zero?” Recall that in Eq. (8.12), we esti-
When a decision tree is built, many of the branches will reflect anomalies in the training mate P(X|Ci ) as the product of the probabilities P(x1 |Ci ), P(x2 |Ci ), . . . , P(xn |Ci ), based
data due to noise or outliers. Tree pruning methods address this problem of overfitting on the assumption of class-conditional independence. These probabilities can be esti-
the data. Such methods typically use statistical measures to remove the least-reliable mated from the training tuples (step 4). We need to compute P(X|Ci ) for each class (i =
branches. An unpruned tree and a pruned version of it are shown in Figure 8.6. Pruned 1, 2, . . . , m) to find the class Ci for which P(X|Ci )P(Ci ) is the maximum (step 5). Let’s
trees tend to be smaller and less complex and, thus, easier to comprehend. They are consider this calculation. For each attribute–value pair (i.e., Ak = xk , for k = 1, 2, . . . , n)
usually faster and better at correctly classifying independent test data (i.e., of previously in tuple X, we need to count the number of tuples having that attribute–value pair, per
unseen tuples) than unpruned trees. class (i.e., per Ci , for i = 1, . . . , m). In Example 8.4, we have two classes (m = 2), namely
“How does tree pruning work?” There are two common approaches to tree pruning: buys computer = yes and buys computer = no. Therefore, for the attribute–value pair
prepruning and postpruning. student = yes of X, say, we need two counts—the number of customers who are students
In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., and for which buys computer = yes (which contributes to P(X|buys computer = yes))
by deciding not to further split or partition the subset of training tuples at a given node). and the number of customers who are students and for which buys computer = no
Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among (which contributes to P(X|buys computer = no)).
the subset tuples or the probability distribution of those tuples. But what if, say, there are no training tuples representing students for the class
When constructing a tree, measures such as statistical significance, information gain, buys computer = no, resulting in P(student = yes|buys computer = no) = 0? In other
Gini index, and so on, can be used to assess the goodness of a split. If partitioning the words, what happens if we should end up with a probability value of zero for some
tuples at a node would result in a split that falls below a prespecified threshold, then fur- P(xk |Ci )? Plugging this zero value into Eq. (8.12) would return a zero probability for
ther partitioning of the given subset is halted. There are difficulties, however, in choosing P(X|Ci ), even though, without the zero probability, we may have ended up with a high
an appropriate threshold. High thresholds could result in oversimplified trees, whereas probability, suggesting that X belonged to class Ci ! A zero probability cancels the effects
low thresholds could result in very little simplification. Figure 8.9 A screenshot of PBC, a system for interactive decision tree construction. Multidimensional of all the other (posteriori) probabilities (on Ci ) involved in the product.
The second and more common approach is postpruning, which removes subtrees training data are viewed as circle segments in the Data Interaction window (left). The Know- There is a simple trick to avoid this problem. We can assume that our training data-
from a “fully grown” tree. A subtree at a given node is pruned by removing its branches ledge Interaction window (right) displays the current decision tree. Source: From Ankerst, base, D, is so large that adding one to each count that we need would only make a
and replacing it with a leaf. The leaf is labeled with the most frequent class among the Elsen, Ester, and Kriegel [AEEK99]. negligible difference in the estimated probability value, yet would conveniently avoid the
subtree being replaced. For example, notice the subtree at node “A3 ?” in the unpruned

8.2 Decision Tree Induction 345 350 Chapter 8 Classification: Basic Concepts 8.4 Rule-Based Classification 355

A1? A1? A tree is interactively constructed as follows. The user visualizes the multidimen- case of probability values of zero. This technique for probability estimation is known as
sional data in the Data Interaction window and selects a splitting attribute and one or the Laplacian correction or Laplace estimator, named after Pierre Laplace, a French
yes no yes no more split-points. The current decision tree in the Knowledge Interaction window is mathematician who lived from 1749 to 1827. If we have, say, q counts to which we each
expanded. The user selects a node of the decision tree. The user may either assign a class add one, then we must remember to add q to the corresponding denominator used in
A2? A 3? A2? class B label to the node (which makes the node a leaf) or request the visualization of the train- the probability calculation. We illustrate this technique in Example 8.5.
ing data corresponding to the node. This leads to a new visualization of every attribute
yes no yes no yes no
except the ones used for splitting criteria on the same path from the root. The interactive Example 8.5 Using the Laplacian correction to avoid computing probability values of zero. Sup-
process continues until a class has been assigned to each leaf of the decision tree. pose that for the class buys computer = yes in some training database, D, containing
A4? class A A5? class B A 4? class A The trees constructed with PBC were compared with trees generated by the CART, 1000 tuples, we have 0 tuples with income = low, 990 tuples with income = medium, and
C4.5, and SPRINT algorithms from various data sets. The trees created with PBC were 10 tuples with income = high. The probabilities of these events, without the Laplacian
yes no yes no yes no
of comparable accuracy with the tree from the algorithmic approaches, yet were signifi- correction, are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000), respectively. Using
cantly smaller and, thus, easier to understand. Users can use their domain knowledge in the Laplacian correction for the three quantities, we pretend that we have 1 more tuple
class A class B class B class A class A class B building a decision tree, but also gain a deeper understanding of their data during the for each income-value pair. In this way, we instead obtain the following probabilities
construction process. (rounded up to three decimal places):
1 991 11
Figure 8.6 An unpruned decision tree and a pruned version of it. = 0.001, = 0.988, and = 0.011,

tree of Figure 8.6. Suppose that the most common class within this subtree is “class B.”
8.3 Bayes Classification Methods
1003 1003 1003
respectively. The “corrected” probability estimates are close to their “uncorrected”
counterparts, yet the zero probability value is avoided.
In the pruned version of the tree, the subtree in question is pruned by replacing it with “What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
the leaf “class B.” predict class membership probabilities such as the probability that a given tuple belongs
to a particular class.
The cost complexity pruning algorithm used in CART is an example of the postprun-
ing approach. This approach considers the cost complexity of a tree to be a function of
the number of leaves in the tree and the error rate of the tree (where the error rate is the
Bayesian classification is based on Bayes’ theorem, described next. Studies compar-
ing classification algorithms have found a simple Bayesian classifier known as the naı̈ve
Bayesian classifier to be comparable in performance with decision tree and selected neu-
8.4 Rule-Based Classification

percentage of tuples misclassified by the tree). It starts from the bottom of the tree. For In this section, we look at rule-based classifiers, where the learned model is represented
each internal node, N , it computes the cost complexity of the subtree at N , and the cost ral network classifiers. Bayesian classifiers have also exhibited high accuracy and speed as a set of IF-THEN rules. We first examine how such rules are used for classification
complexity of the subtree at N if it were to be pruned (i.e., replaced by a leaf node). The when applied to large databases. (Section 8.4.1). We then study ways in which they can be generated, either from a deci-
two values are compared. If pruning the subtree at node N would result in a smaller cost Naı̈ve Bayesian classifiers assume that the effect of an attribute value on a given class sion tree (Section 8.4.2) or directly from the training data using a sequential covering
complexity, then the subtree is pruned. Otherwise, it is kept. is independent of the values of the other attributes. This assumption is called class- algorithm (Section 8.4.3).
A pruning set of class-labeled tuples is used to estimate cost complexity. This set is conditional independence. It is made to simplify the computations involved and, in this
independent of the training set used to build the unpruned tree and of any test set used sense, is considered “naı̈ve.”
for accuracy estimation. The algorithm generates a set of progressively pruned trees. In Section 8.3.1 reviews basic probability notation and Bayes’ theorem. In Section 8.3.2 8.4.1 Using IF-THEN Rules for Classification
general, the smallest decision tree that minimizes the cost complexity is preferred. you will learn how to do naı̈ve Bayesian classification. Rules are a good way of representing information or bits of knowledge. A rule-based
C4.5 uses a method called pessimistic pruning, which is similar to the cost complex- classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expres-
ity method in that it also uses error rate estimates to make decisions regarding subtree sion of the form
pruning. Pessimistic pruning, however, does not require the use of a prune set. Instead, 8.3.1 Bayes’ Theorem
it uses the training set to estimate error rates. Recall that an estimate of accuracy or Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who IF condition THEN conclusion.
error based on the training set is overly optimistic and, therefore, strongly biased. The did early work in probability and decision theory during the 18th century. Let X be a
pessimistic pruning method therefore adjusts the error rates obtained from the training data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by An example is rule R1,
set by adding a penalty, so as to counter the bias incurred. measurements made on a set of n attributes. Let H be some hypothesis such as that
Rather than pruning trees based on estimated error rates, we can prune trees based R1: IF age = youth AND student = yes THEN buys computer = yes.
the data tuple X belongs to a specified class C. For classification problems, we want to
on the number of bits required to encode them. The “best” pruned tree is the one that determine P(H|X), the probability that the hypothesis H holds given the “evidence” or The “IF” part (or left side) of a rule is known as the rule antecedent or precondition.
minimizes the number of encoding bits. This method adopts the MDL principle, which observed data tuple X. In other words, we are looking for the probability that tuple X The “THEN” part (or right side) is the rule consequent. In the rule antecedent, the
was briefly introduced in Section 8.2.2. The basic idea is that the simplest solution is pre- belongs to class C, given that we know the attribute description of X. condition consists of one or more attribute tests (e.g., age = youth and student = yes)
ferred. Unlike cost complexity pruning, it does not require an independent set of tuples.
356 Chapter 8 Classification: Basic Concepts 8.4 Rule-Based Classification 361 366 Chapter 8 Classification: Basic Concepts

that are logically ANDed. The rule’s consequent contains a class prediction (in this case, adopts a greedy depth-first strategy. Each time it is faced with adding a new attribute Predicted class
we are predicting whether a customer will buy a computer). R1 can also be written as test (conjunct) to the current rule, it picks the one that most improves the rule qual- yes no Total
ity, based on the training samples. We will say more about rule quality measures in a Actual class yes TP FN P
R1: (age = youth) ∧ (student = yes) ⇒ (buys computer = yes). minute. For the moment, let’s say we use rule accuracy as our quality measure. Getting no FP TN N
back to our example with Figure 8.11, suppose Learn One Rule finds that the attribute Total P′ N′ P+N
If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given test income = high best improves the accuracy of our current (empty) rule. We append
tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) it to the condition, so that the current rule becomes
and that the rule covers the tuple.
IF income = high THEN loan decision = accept. Figure 8.14 Confusion matrix, shown with totals for positive and negative tuples.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class-
labeled data set, D, let ncovers be the number of tuples covered by R; ncorrect be the number
Each time we add an attribute test to a rule, the resulting rule should cover relatively
of tuples correctly classified by R; and |D| be the number of tuples in D. We can define Classes buys computer = yes buys computer = no Total Recognition (%)
more of the “accept” tuples. During the next iteration, we again consider the possible
the coverage and accuracy of R as buys computer = yes 6954 46 7000 99.34
attribute tests and end up selecting credit rating = excellent. Our current rule grows to
become buys computer = no 412 2588 3000 86.27
ncovers Total 7366 2634 10,000 95.42
coverage(R) = (8.16)
|D| IF income = high AND credit rating = excellent THEN loan decision = accept.
ncorrect
accuracy(R) = . (8.17) The process repeats, where at each step we continue to greedily grow rules until the
ncovers resulting rule meets an acceptable quality level. Figure 8.15 Confusion matrix for the classes buys computer = yes and buys computer = no, where an
Greedy search does not allow for backtracking. At each step, we heuristically add what entry in row i and column j shows the number of tuples of class i that were labeled by the
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., their appears to be the best choice at the moment. What if we unknowingly made a poor classifier as class j. Ideally, the nondiagonal entries should be zero or close to zero.
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the choice along the way? To lessen the chance of this happening, instead of selecting the best
tuples that it covers and see what percentage of them the rule can correctly classify. attribute test to append to the current rule, we can select the best k attribute tests. In this mislabeling). Given m classes (where m ≥ 2), a confusion matrix is a table of at least
way, we perform a beam search of width k, wherein we maintain the k best candidates size m by m. An entry, CMi,j in the first m rows and m columns indicates the number
Example 8.6 Rule accuracy and coverage. Let’s go back to our data in Table 8.1. These are class- overall at each step, rather than a single best candidate. of tuples of class i that were labeled by the classifier as class j. For a classifier to have
labeled tuples from the AllElectronics customer database. Our task is to predict whether good accuracy, ideally most of the tuples would be represented along the diagonal of the
a customer will buy a computer. Consider rule R1, which covers 2 of the 14 tuples. Rule Quality Measures confusion matrix, from entry CM1,1 to entry CMm,m , with the rest of the entries being
It can correctly classify both tuples. Therefore, coverage(R1) = 2/14 = 14.28% and zero or close to zero. That is, ideally, FP and FN are around zero.
Learn One Rule needs a measure of rule quality. Every time it considers an attribute test,
accuracy(R1) = 2/2 = 100%. it must check to see if appending such a test to the current rule’s condition will result The table may have additional rows or columns to provide totals. For example, in
in an improved rule. Accuracy may seem like an obvious choice at first, but consider the confusion matrix of Figure 8.14, P and N are shown. In addition, P ′ is the number
Let’s see how we can use rule-based classification to predict the class label of a given Example 8.8. of tuples that were labeled as positive (TP + FP) and N ′ is the number of tuples that
tuple, X. If a rule is satisfied by X, the rule is said to be triggered. For example, suppose were labeled as negative (TN + FN ). The total number of tuples is TP + TN + FP + TN ,
we have Example 8.8 Choosing between two rules based on accuracy. Consider the two rules as illustrated or P + N , or P ′ + N ′ . Note that although the confusion matrix shown is for a binary
in Figure 8.12. Both are for the class loan decision = accept. We use “a” to represent the classification problem, confusion matrices can be easily drawn for multiple classes in a
X= (age = youth, income = medium, student = yes, credit rating = fair). similar manner.
tuples of class “accept” and “r” for the tuples of class “reject.” Rule R1 correctly classifies
We would like to classify X according to buys computer. X satisfies R1, which triggers 38 of the 40 tuples it covers. Rule R2 covers only two tuples, which it correctly classifies. Now let’s look at the evaluation measures, starting with accuracy. The accuracy of a
the rule. Their respective accuracies are 95% and 100%. Thus, R2 has greater accuracy than R1, classifier on a given test set is the percentage of test set tuples that are correctly classified
If R1 is the only rule satisfied, then the rule fires by returning the class prediction but it is not the better rule because of its small coverage. by the classifier. That is,
for X. Note that triggering does not always mean firing because there may be more than TP + TN
one rule that is satisfied! If more than one rule is triggered, we have a potential problem. From this example, we see that accuracy on its own is not a reliable estimate of rule accuracy = . (8.21)
P +N
What if they each specify a different class? Or what if no rule is satisfied by X? quality. Coverage on its own is not useful either—for a given class we could have a rule
We tackle the first question. If more than one rule is triggered, we need a conflict that covers many tuples, most of which belong to other classes! Thus, we seek other mea- In the pattern recognition literature, this is also referred to as the overall recognition
resolution strategy to figure out which rule gets to fire and assign its class prediction sures for evaluating rule quality, which may integrate aspects of accuracy and coverage. rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the var-
to X. There are many possible strategies. We look at two, namely size ordering and rule Here we will look at a few, namely entropy, another based on information gain, and a ious classes. An example of a confusion matrix for the two classes buys computer = yes
ordering. statistical test that considers coverage. For our discussion, suppose we are learning rules (positive) and buys computer = no (negative) is given in Figure 8.15. Totals are shown,

8.4 Rule-Based Classification 357 362 Chapter 8 Classification: Basic Concepts 8.5 Model Evaluation and Selection 367

The size ordering scheme assigns the highest priority to the triggering rule that has r as well as the recognition rates per class and overall. By glancing at a confusion matrix,
the “toughest” requirements, where toughness is measured by the rule antecedent size. R1 r r it is easy to see if the corresponding classifier is confusing two classes.
That is, the triggering rule with the most attribute tests is fired. a a a For example, we see that it mislabeled 412 “no” tuples as “yes.” Accuracy is most
a a a r
The rule ordering scheme prioritizes the rules beforehand. The ordering may be a
a a r a effective when the class distribution is relatively balanced.
a a r r
class-based or rule-based. With class-based ordering, the classes are sorted in order of a a We can also speak of the error rate or misclassification rate of a classifier, M, which
a a a a
decreasing “importance” such as by decreasing order of prevalence. That is, all the rules a is simply 1 − accuracy(M), where accuracy(M) is the accuracy of M. This also can be
a a a a
for the most prevalent (or most frequent) class come first, the rules for the next prevalent a a R2 r computed as
r a a
class come next, and so on. Alternatively, they may be sorted based on the misclassifica- a a a a a a
a FP + FN
a a a
tion cost per class. Within each class, the rules are not ordered—they don’t have to be a a a error rate = . (8.22)
because they all predict the same class (and so there can be no class conflict!). r
P +N
With rule-based ordering, the rules are organized into one long priority list, accord- If we were to use the training set (instead of a test set) to estimate the error rate of
ing to some measure of rule quality, such as accuracy, coverage, or size (number of a model, this quantity is known as the resubstitution error. This error estimate is
attribute tests in the rule antecedent), or based on advice from domain experts. When Figure 8.12 Rules for the class loan decision = accept, showing accept (a) and reject (r) tuples. optimistic of the true error rate (and similarly, the corresponding accuracy estimate is
rule ordering is used, the rule set is known as a decision list. With rule ordering, the trig- optimistic) because the model is not tested on any samples that it has not already seen.
gering rule that appears earliest in the list has the highest priority, and so it gets to fire its We now consider the class imbalance problem, where the main class of interest is
class prediction. Any other rule that satisfies X is ignored. Most rule-based classification for the class c. Our current rule is R: IF condition THEN class = c. We want to see if rare. That is, the data set distribution reflects a significant majority of the negative class
systems use a class-based rule-ordering strategy. logically ANDing a given attribute test to condition would result in a better rule. We call and a minority positive class. For example, in fraud detection applications, the class of
Note that in the first strategy, overall the rules are unordered. They can be applied in the new condition, condition′ , where R′ : IF condition′ THEN class = c is our potential interest (or positive class) is “fraud,” which occurs much less frequently than the negative
any order when classifying a tuple. That is, a disjunction (logical OR) is implied between new rule. In other words, we want to see if R′ is any better than R. “nonfraudulant” class. In medical data, there may be a rare class, such as “cancer.” Sup-
each of the rules. Each rule represents a standalone nugget or piece of knowledge. This We have already seen entropy in our discussion of the information gain measure used pose that you have trained a classifier to classify medical data tuples, where the class
is in contrast to the rule ordering (decision list) scheme for which rules must be applied for attribute selection in decision tree induction (Section 8.2.2, Eq. 8.1). It is also known label attribute is “cancer” and the possible class values are “yes” and “no.” An accu-
in the prescribed order so as to avoid conflicts. Each rule in a decision list implies the as the expected information needed to classify a tuple in data set, D. Here, D is the set racy rate of, say, 97% may make the classifier seem quite accurate, but what if only,
negation of the rules that come before it in the list. Hence, rules in a decision list are of tuples covered by condition′ and pi is the probability of class Ci in D. The lower the say, 3% of the training tuples are actually cancer? Clearly, an accuracy rate of 97% may
more difficult to interpret. entropy, the better condition′ is. Entropy prefers conditions that cover a large number of not be acceptable—the classifier could be correctly labeling only the noncancer tuples,
Now that we have seen how we can handle conflicts, let’s go back to the scenario tuples of a single class and few tuples of other classes. for instance, and misclassifying all the cancer tuples. Instead, we need other measures,
where there is no rule satisfied by X. How, then, can we determine the class label of X? Another measure is based on information gain and was proposed in FOIL (First which access how well the classifier can recognize the positive tuples (cancer = yes) and
In this case, a fallback or default rule can be set up to specify a default class, based on Order Inductive Learner), a sequential covering algorithm that learns first-order logic how well it can recognize the negative tuples (cancer = no).
a training set. This may be the class in majority or the majority class of the tuples that rules. Learning first-order rules is more complex because such rules contain variables, The sensitivity and specificity measures can be used, respectively, for this purpose.
were not covered by any rule. The default rule is evaluated at the end, if and only if no whereas the rules we are concerned with in this section are propositional (i.e., variable- Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion
other rule covers X. The condition in the default rule is empty. In this way, the rule fires free).5 In machine learning, the tuples of the class for which we are learning rules are of positive tuples that are correctly identified), while specificity is the true negative rate
when no other rule is satisfied. called positive tuples, while the remaining tuples are negative. Let pos (neg) be the num- (i.e., the proportion of negative tuples that are correctly identified). These measures are
In the following sections, we examine how to build a rule-based classifier. ber of positive (negative) tuples covered by R. Let pos ′ (neg ′ ) be the number of positive defined as
(negative) tuples covered by R′ . FOIL assesses the information gained by extending
TP
condition′ as sensitivity = (8.23)
8.4.2 Rule Extraction from a Decision Tree P
pos′
 
pos TN
In Section 8.2, we learned how to build a decision tree classifier from a set of training FOIL Gain = pos′ × log2 − log2 . (8.18) specificity = . (8.24)
pos′ + neg′ pos + neg N
data. Decision tree classifiers are a popular method of classification—it is easy to under-
stand how decision trees work and they are known for their accuracy. Decision trees can It favors rules that have high accuracy and cover many positive tuples. It can be shown that accuracy is a function of sensitivity and specificity:
become large and difficult to interpret. In this subsection, we look at how to build a rule- We can also use a statistical test of significance to determine if the apparent effect of P N
based classifier by extracting IF-THEN rules from a decision tree. In comparison with a a rule is not attributed to chance but instead indicates a genuine correlation between accuracy = sensitivity + specificity . (8.25)
(P + N ) (P + N )
decision tree, the IF-THEN rules may be easier for humans to understand, particularly
if the decision tree is very large.
5 Incidentally, Example 8.9 Sensitivity and specificity. Figure 8.16 shows a confusion matrix for medical data
To extract rules from a decision tree, one rule is created for each path from the root FOIL was also proposed by Quinlan, the father of ID3.
to a leaf node. Each splitting criterion along a given path is logically ANDed to form the where the class values are yes and no for a class label attribute, cancer. The sensitivity

358 Chapter 8 Classification: Basic Concepts 8.4 Rule-Based Classification 363 368 Chapter 8 Classification: Basic Concepts

rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule attribute values and classes. The test compares the observed distribution among classes Classes yes no Total Recognition (%)
consequent (“THEN” part). of tuples covered by a rule with the expected distribution that would result if the yes 90 210 300 30.00
rule made predictions at random. We want to assess whether any observed differences no 140 9560 9700 98.56
Example 8.7 Extracting classification rules from a decision tree. The decision tree of Figure 8.2 can between these two distributions may be attributed to chance. We can use the likelihood Total 230 9770 10,000 96.40
be converted to classification IF-THEN rules by tracing the path from the root node to ratio statistic,
each leaf node in the tree. The rules extracted from Figure 8.2 are as follows:
m  
X fi Figure 8.16 Confusion matrix for the classes cancer = yes and cancer = no.
R1: IF age = youth AND student = no THEN buys computer = no Likelihood Ratio = 2 fi log , (8.19)
ei
R2: IF age = youth AND student = yes THEN buys computer = yes i=1
R3: IF age = middle aged 90 9560
THEN buys computer = yes of the classifier is 300 = 30.00%. The specificity is 9700 = 98.56%. The classifier’s over-
where m is the number of classes. 9650
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes all accuracy is 10,000 = 96.50%. Thus, we note that although the classifier has a high
For tuples satisfying the rule, fi is the observed frequency of each class i among the
R5: IF age = senior AND credit rating = fair THEN buys computer = no tuples. ei is what we would expect the frequency of each class i to be if the rule made accuracy, it’s ability to correctly label the positive (rare) class is poor given its low sen-
random predictions. The statistic has a χ 2 distribution with m − 1 degrees of freedom. sitivity. It has high specificity, meaning that it can accurately recognize negative tuples.
The higher the likelihood ratio, the more likely that there is a significant difference in the Techniques for handling class-imbalanced data are given in Section 8.6.5.
A disjunction (logical OR) is implied between each of the extracted rules. Because the
number of correct predictions made by our rule in comparison with a “random guessor.”
rules are extracted directly from the tree, they are mutually exclusive and exhaustive. The precision and recall measures are also widely used in classification. Precision
That is, the performance of our rule is not due to chance. The ratio helps identify rules
Mutually exclusive means that we cannot have rule conflicts here because no two rules can be thought of as a measure of exactness (i.e., what percentage of tuples labeled as
with insignificant coverage.
will be triggered for the same tuple. (We have one rule per leaf, and any tuple can map positive are actually such), whereas recall is a measure of completeness (what percentage
CN2 uses entropy together with the likelihood ratio test, while FOIL’s information
to only one leaf.) Exhaustive means there is one rule for each possible attribute–value of positive tuples are labeled as such). If recall seems familiar, that’s because it is the same
gain is used by RIPPER.
combination, so that this set of rules does not require a default rule. Therefore, the order as sensitivity (or the true positive rate). These measures can be computed as
of the rules does not matter—they are unordered.
Since we end up with one rule per leaf, the set of extracted rules is not much simpler TP
Rule Pruning precision = (8.26)
than the corresponding decision tree! The extracted rules may be even more difficult TP + FP
to interpret than the original trees in some cases. As an example, Figure 8.7 showed Learn One Rule does not employ a test set when evaluating rules. Assessments of rule TP TP
decision trees that suffer from subtree repetition and replication. The resulting set of recall = = . (8.27)
quality as described previously are made with tuples from the original training data. TP + FN P
rules extracted can be large and difficult to follow, because some of the attribute tests These assessments are optimistic because the rules will likely overfit the data. That is,
may be irrelevant or redundant. So, the plot thickens. Although it is easy to extract rules the rules may perform well on the training data, but less well on subsequent data. To Example 8.10 Precision and recall. The precision of the classifier in Figure 8.16 for the yes class is
90 90
from a decision tree, we may need to do some more work by pruning the resulting compensate for this, we can prune the rules. A rule is pruned by removing a conjunct 230 = 39.13%. The recall is 300 = 30.00%, which is the same calculation for sensitivity
rule set. (attribute test). We choose to prune a rule, R, if the pruned version of R has greater in Example 8.9.
“How can we prune the rule set?” For a given rule antecedent, any condition that does quality, as assessed on an independent set of tuples. As in decision tree pruning, we refer
not improve the estimated accuracy of the rule can be pruned (i.e., removed), thereby to this set as a pruning set. Various pruning strategies can be used such as the pessimistic A perfect precision score of 1.0 for a class C means that every tuple that the classifier
generalizing the rule. C4.5 extracts rules from an unpruned tree, and then prunes the pruning approach described in the previous section. labeled as belonging to class C does indeed belong to class C. However, it does not tell
rules using a pessimistic approach similar to its tree pruning method. The training tuples FOIL uses a simple yet effective method. Given a rule, R, us anything about the number of class C tuples that the classifier mislabeled. A perfect
and their associated class labels are used to estimate rule accuracy. However, because this recall score of 1.0 for C means that every item from class C was labeled as such, but it
would result in an optimistic estimate, alternatively, the estimate is adjusted to compen- pos − neg does not tell us how many other tuples were incorrectly labeled as belonging to class C.
FOIL Prune(R) = , (8.20)
sate for the bias, resulting in a pessimistic estimate. In addition, any rule that does not pos + neg There tends to be an inverse relationship between precision and recall, where it is possi-
contribute to the overall accuracy of the entire rule set can also be pruned. ble to increase one at the cost of reducing the other. For example, our medical classifier
Other problems arise during rule pruning, however, as the rules will no longer be where pos and neg are the number of positive and negative tuples covered by R, respec- may achieve high precision by labeling all cancer tuples that present a certain way as
mutually exclusive and exhaustive. For conflict resolution, C4.5 adopts a class-based tively. This value will increase with the accuracy of R on a pruning set. Therefore, if the cancer, but may have low recall if it mislabels many other instances of cancer tuples. Pre-
ordering scheme. It groups together all rules for a single class, and then determines a FOIL Prune value is higher for the pruned version of R, then we prune R. cision and recall scores are typically used together, where precision values are compared
ranking of these class rule sets. Within a rule set, the rules are not ordered. C4.5 orders By convention, RIPPER starts with the most recently added conjunct when con- for a fixed value of recall, or vice versa. For example, we may compare precision values
the class rule sets so as to minimize the number of false-positive errors (i.e., where a sidering pruning. Conjuncts are pruned one at a time as long as this results in an at a recall value of, say, 0.75.
rule predicts a class, C, but the actual class is not C). The class rule set with the least improvement. An alternative way to use precision and recall is to combine them into a single mea-
number of false positives is examined first. Once pruning is complete, a final check is sure. This is the approach of the F measure (also known as the F1 score or F-score) and

8.4 Rule-Based Classification 359 364 Chapter 8 Classification: Basic Concepts 8.5 Model Evaluation and Selection 369

done to remove any duplicates. When choosing a default class, C4.5 does not choose
the majority class, because this class will likely have many rules for its tuples. Instead, it
selects the class that contains the most training tuples that were not covered by any rule.
8.5 Model Evaluation and Selection the Fβ measure. They are defined as

F=
2 × precision × recall
(8.28)
Now that you may have built a classification model, there may be many questions going
precision + recall
through your mind. For example, suppose you used data from previous sales to build
a classifier to predict customer purchasing behavior. You would like an estimate of how (1 + β 2 ) × precision × recall
Fβ = , (8.29)
8.4.3 Rule Induction Using a Sequential Covering Algorithm accurately the classifier can predict the purchasing behavior of future customers, that β 2 × precision + recall
is, future customer data on which the classifier has not been trained. You may even
IF-THEN rules can be extracted directly from the training data (i.e., without having to where β is a non-negative real number. The F measure is the harmonic mean of precision
have tried different methods to build more than one classifier and now wish to compare
generate a decision tree first) using a sequential covering algorithm. The name comes and recall (the proof of which is left as an exercise). It gives equal weight to precision and
their accuracy. But what is accuracy? How can we estimate it? Are some measures of a
from the notion that the rules are learned sequentially (one at a time), where each rule recall. The Fβ measure is a weighted measure of precision and recall. It assigns β times
classifier’s accuracy more appropriate than others? How can we obtain a reliable accuracy
for a given class will ideally cover many of the class’s tuples (and hopefully none of as much weight to recall as to precision. Commonly used Fβ measures are F2 (which
estimate? These questions are addressed in this section.
the tuples of other classes). Sequential covering algorithms are the most widely used weights recall twice as much as precision) and F0.5 (which weights precision twice as
Section 8.5.1 describes various evaluation metrics for the predictive accuracy
approach to mining disjunctive sets of classification rules, and form the topic of this much as recall).
of a classifier. Holdout and random subsampling (Section 8.5.2), cross-validation
subsection. “Are there other cases where accuracy may not be appropriate?” In classification prob-
(Section 8.5.3), and bootstrap methods (Section 8.5.4) are common techniques for
There are many sequential covering algorithms. Popular variations include AQ, CN2, lems, it is commonly assumed that all tuples are uniquely classifiable, that is, that each
assessing accuracy, based on randomly sampled partitions of the given data. What if
and the more recent RIPPER. The general strategy is as follows. Rules are learned one at training tuple can belong to only one class. Yet, owing to the wide diversity of data in
we have more than one classifier and want to choose the “best” one? This is referred
a time. Each time a rule is learned, the tuples covered by the rule are removed, and the large databases, it is not always reasonable to assume that all tuples are uniquely classi-
to as model selection (i.e., choosing one classifier over another). The last two sections
process repeats on the remaining tuples. This sequential learning of rules is in contrast fiable. Rather, it is more probable to assume that each tuple may belong to more than
address this issue. Section 8.5.5 discusses how to use tests of statistical significance
to decision tree induction. Because the path to each leaf in a decision tree corresponds to one class. How then can the accuracy of classifiers on large databases be measured? The
to assess whether the difference in accuracy between two classifiers is due to chance.
a rule, we can consider decision tree induction as learning a set of rules simultaneously. accuracy measure is not appropriate, because it does not take into account the possibility
Section 8.5.6 presents how to compare classifiers based on cost–benefit and receiver
A basic sequential covering algorithm is shown in Figure 8.10. Here, rules are learned of tuples belonging to more than one class.
operating characteristic (ROC) curves.
for one class at a time. Ideally, when learning a rule for a class, C, we would like the rule Rather than returning a class label, it is useful to return a probability class distri-
to cover all (or many) of the training tuples of class C and none (or few) of the tuples bution. Accuracy measures may then use a second guess heuristic, whereby a class
8.5.1 Metrics for Evaluating Classifier Performance prediction is judged as correct if it agrees with the first or second most probable class.
Although this does take into consideration, to some degree, the nonunique classification
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification. This section presents measures for assessing how good or how “accurate” your classifier
of tuples, it is not a complete solution.
is at predicting the class label of tuples. We will consider the case of where the class tuples
Input: In addition to accuracy-based measures, classifiers can also be compared with respect
are more or less evenly distributed, as well as the case where classes are unbalanced (e.g.,
to the following additional aspects:
D, a data set of class-labeled tuples; where an important class of interest is rare such as in medical tests). The classifier eval-
Att vals, the set of all attributes and their possible values.
uation measures presented in this section are summarized in Figure 8.13. They include
Speed: This refers to the computational costs involved in generating and using the
accuracy (also known as recognition rate), sensitivity (or recall), specificity, precision,
given classifier.
Output: A set of IF-THEN rules. F1 , and Fβ . Note that although accuracy is a specific measure, the word “accuracy” is
Method: also used as a general term to refer to a classifier’s predictive abilities. Robustness: This is the ability of the classifier to make correct predictions given noisy
Using training data to derive a classifier and then estimate the accuracy of the data or data with missing values. Robustness is typically assessed with a series of
(1) Rule set = {}; // initial set of rules learned is empty resulting learned model can result in misleading overoptimistic estimates due to over- synthetic data sets representing increasing degrees of noise and missing values.
(2) for each class c do
specialization of the learning algorithm to the data. (We will say more on this in a Scalability: This refers to the ability to construct the classifier efficiently given large
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c); moment!) Instead, it is better to measure the classifier’s accuracy on a test set consisting amounts of data. Scalability is typically assessed with a series of data sets of increasing
(5) remove tuples covered by Rule from D; of class-labeled tuples that were not used to train the model. size.
(6) Rule set = Rule set + Rule; // add new rule to rule set Before we discuss the various measures, we need to become comfortable with
some terminology. Recall that we can talk in terms of positive tuples (tuples of the Interpretability: This refers to the level of understanding and insight that is provided
(7) until terminating condition;
(8) endfor main class of interest) and negative tuples (all other tuples).6 Given two classes, for by the classifier or predictor. Interpretability is subjective and therefore more difficult
(9) return Rule Set; example, the positive tuples may be buys computer = yes while the negative tuples are to assess. Decision trees and classification rules can be easy to interpret, yet their
interpretability may diminish the more they become complex. We discuss some work
in this area, such as the extraction of classification rules from a “black box” neural
Figure 8.10 Basic sequential covering algorithm. 6 In the machine learning and pattern recognition literature, these are referred to as positive samples and network classifier called backpropagation, in Chapter 9.
negative samples, respectively.

360 Chapter 8 Classification: Basic Concepts 8.5 Model Evaluation and Selection 365 370 Chapter 8 Classification: Basic Concepts

from other classes. In this way, the rules learned should be of high accuracy. The rules Measure Formula Derive Estimate
need not necessarily be of high coverage. This is because we can have more than one rule Training
TP + TN set model accuracy
for a class, so that different rules may cover different tuples within the same class. The accuracy, recognition rate P +N
process continues until the terminating condition is met, such as when there are no more FP + FN
error rate, misclassification rate
training tuples or the quality of a rule returned is below a user-specified threshold. The P +N Data
Learn One Rule procedure finds the “best” rule for the current class, given the current sensitivity, true positive rate, TP
set of training tuples. recall P
“How are rules learned?” Typically, rules are grown in a general-to-specific manner Test set
TN
specificity, true negative rate N
(Figure 8.11). We can think of this as a beam search, where we start off with an empty
rule and then gradually keep appending attribute tests to it. We append by adding the precision TP
TP + FP
attribute test as a logical conjunct to the existing condition of the rule antecedent. Sup- Figure 8.17 Estimating accuracy with the holdout method.
pose our training set, D, consists of loan application data. Attributes regarding each F, F1 , F-score, 2 × precision × recall
harmonic mean of precision and recall precision + recall
applicant include their age, income, education level, residence, credit rating, and the
term of the loan. The classifying attribute is loan decision, which indicates whether a (1 + β 2 ) × precision × recall In summary, we have presented several evaluation measures. The accuracy measure
Fβ , where β is a non-negative real number β 2 × precision + recall
loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the works best when the data classes are fairly evenly distributed. Other measures, such as
class “accept,” we start off with the most general rule possible, that is, the condition of sensitivity (or recall), specificity, precision, F, and Fβ , are better suited to the class imbal-
the rule antecedent is empty. The rule is ance problem, where the main class of interest is rare. The remaining subsections focus
Figure 8.13 Evaluation measures. Note that some measures are known by more than one name. on obtaining reliable classifier accuracy estimates.
IF THEN loan decision = accept. TP, TN , FP, P, N refer to the number of true positive, true negative, false positive, positive,
and negative samples, respectively (see text).
We then consider each possible attribute test that may be added to the rule. These 8.5.2 Holdout Method and Random Subsampling
can be derived from the parameter Att vals, which contains a list of attributes with their
The holdout method is what we have alluded to so far in our discussions about accuracy.
associated values. For example, for an attribute–value pair (att, val), we can consider buys computer = no. Suppose we use our classifier on a test set of labeled tuples. P is the In this method, the given data are randomly partitioned into two independent sets, a
attribute tests such as att = val, att ≤ val, att > val, and so on. Typically, the training number of positive tuples and N is the number of negative tuples. For each tuple, we training set and a test set. Typically, two-thirds of the data are allocated to the training
data will contain many attributes, each of which may have several possible values. Find- compare the classifier’s class label prediction with the tuple’s known class label. set, and the remaining one-third is allocated to the test set. The training set is used to
ing an optimal rule set becomes computationally explosive. Instead, Learn One Rule There are four additional terms we need to know that are the “building blocks” used derive the model. The model’s accuracy is then estimated with the test set (Figure 8.17).
in computing many evaluation measures. Understanding them will make it easy to grasp The estimate is pessimistic because only a portion of the initial data is used to derive
the meaning of the various measures. the model.
IF
THEN loan_decision ⫽ accept Random subsampling is a variation of the holdout method in which the holdout
True positives (TP): These refer to the positive tuples that were correctly labeled by
method is repeated k times. The overall accuracy estimate is taken as the average of the
the classifier. Let TP be the number of true positives.
accuracies obtained from each iteration.
True negatives (TN ): These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
IF loan_term ⫽ short IF loan_term ⫽ long IF income ⫽ high IF income ⫽medium ···
8.5.3 Cross-Validation
THEN loan_decision THEN loan_decision THEN loan_decision ⫽ accept THEN loan_decision False positives (FP): These are the negative tuples that were incorrectly labeled as
⫽ accept ⫽ accept ⫽ accept
positive (e.g., tuples of class buys computer = no for which the classifier predicted In k-fold cross-validation, the initial data are randomly partitioned into k mutually
buys computer = yes). Let FP be the number of false positives. exclusive subsets or “folds,” D1 , D2 , . . . , Dk , each of approximately equal size. Training
and testing is performed k times. In iteration i, partition Di is reserved as the test set,
False negatives (FN ): These are the positive tuples that were mislabeled as neg- and the remaining partitions are collectively used to train the model. That is, in the
··· ···
ative (e.g., tuples of class buys computer = yes for which the classifier predicted first iteration, subsets D2 , . . . , Dk collectively serve as the training set to obtain a first
IF income ⫽ high AND IF income ⫽ high AND IF income ⫽ high AND
IF income ⫽ high AND
age ⫽ youth age ⫽ middle_age
credit_rating ⫽ excellent
credit_rating ⫽ fair buys computer = no). Let FN be the number of false negatives. model, which is tested on D1 ; the second iteration is trained on subsets D1 , D3 , . . . , Dk
THEN loan_decision THEN loan_decision THEN loan_decision
THEN loan_decision ⫽ accept
⫽ accept ⫽ accept ⫽ accept and tested on D2 ; and so on. Unlike the holdout and random subsampling methods,
These terms are summarized in the confusion matrix of Figure 8.14. here each sample is used the same number of times for training and once for testing. For
The confusion matrix is a useful tool for analyzing how well your classifier can classification, the accuracy estimate is the overall number of correct classifications from
Figure 8.11 A general-to-specific search through rule space. recognize tuples of different classes. TP and TN tell us when the classifier is getting the k iterations, divided by the total number of tuples in the initial data.
things right, while FP and FN tell us when the classifier is getting things wrong (i.e.,
8.5 Model Evaluation and Selection 371 376 Chapter 8 Classification: Basic Concepts 8.6 Techniques to Improve Classification Accuracy 381

Leave-one-out is a special case of k-fold cross-validation where k is set to the number 1.0 with replacement is used—the same tuple may be selected more than once. Each tuple’s
of initial tuples. That is, only one sample is “left out” at a time for the test set. In strat- chance of being selected is based on its weight. A classifier model, Mi , is derived from
ified cross-validation, the folds are stratified so that the class distribution of the tuples ll the training tuples of Di . Its error is then calculated using Di as a test set. The weights of
hu
in each fold is approximately the same as that in the initial data. 0.8 ex the training tuples are then adjusted according to how they were classified.
nv
Co ROC
In general, stratified 10-fold cross-validation is recommended for estimating accu- If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly

True positive rate (TPR)


racy (even if computation power allows using more folds) due to its relatively low bias classified, its weight is decreased. A tuple’s weight reflects how difficult it is to classify—
and variance. 0.6 the higher the weight, the more often it has been misclassified. These weights will be
used to generate the training samples for the classifier of the next round. The basic idea

g
in
ess
is that when we build a classifier, we want it to focus more on the misclassified tuples of
8.5.4 Bootstrap

gu
the previous round. Some classifiers may be better at classifying some “difficult” tuples

om
0.4

nd
Unlike the accuracy estimation methods just mentioned, the bootstrap method sam- than others. In this way, we build a series of classifiers that complement each other. The

Ra
ples the given training tuples uniformly with replacement. That is, each time a tuple is algorithm is summarized in Figure 8.24.
selected, it is equally likely to be selected again and re-added to the training set. For 0.2
Now, let’s look at some of the math that’s involved in the algorithm. To compute
instance, imagine a machine that randomly selects tuples for our training set. In sam- the error rate of model Mi , we sum the weights of each of the tuples in Di that Mi
pling with replacement, the machine is allowed to select the same tuple more than once. misclassified. That is,
There are several bootstrap methods. A commonly used one is the .632 bootstrap,
0.0 d
which works as follows. Suppose we are given a data set of d tuples. The data set is 0 0.2 0.4 0.6 0.8 1.0
X
error(Mi ) = wj × err(Xj ), (8.34)
sampled d times, with replacement, resulting in a bootstrap sample or training set of d False positive rate (FPR)
j=1
samples. It is very likely that some of the original data tuples will occur more than once
in this sample. The data tuples that did not make it into the training set end up forming where err(Xj ) is the misclassification error of tuple Xj : If the tuple was misclassified, then
the test set. Suppose we were to try this out several times. As it turns out, on average, Figure 8.19 ROC curve for the data in Figure 8.18.
err(Xj ) is 1; otherwise, it is 0. If the performance of classifier Mi is so poor that its error
63.2% of the original data tuples will end up in the bootstrap sample, and the remaining exceeds 0.5, then we abandon it. Instead, we try again by generating a new Di training
36.8% will form the test set (hence, the name, .632 bootstrap). remaining nine tuples, which are all classified as negative, five actually are negative (thus,
set, from which we derive a new Mi .
“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of TN = 5). The remaining four are all actually positive, thus, FN = 4. We can therefore
1 The error rate of Mi affects how the weights of the training tuples are updated.
being selected, so the probability of not being chosen is (1 − 1/d). We have to select compute TPR = TP P = 5 = 0.2, while FPR = 0. Thus, we have the point (0.2, 0) for the If a tuple in round i was correctly classified, its weight is multiplied by error(Mi )/
d times, so the probability that a tuple will not be chosen during this whole time is ROC curve.
(1 − error(Mi )). Once the weights of all the correctly classified tuples are updated, the
(1 − 1/d)d . If d is large, the probability approaches e −1 = 0.368.7 Thus, 36.8% of tuples Next, threshold t is set to 0.8, the probability value for tuple 2, so this tuple is now
weights for all tuples (including the misclassified ones) are normalized so that their sum
will not be selected for training and thereby end up in the test set, and the remaining also considered positive, while tuples 3 through 10 are considered negative. The actual
remains the same as it was before. To normalize a weight, we multiply it by the sum of
63.2% will form the training set. class label of tuple 2 is positive, thus now TP = 2. The rest of the row can easily be
the old weights, divided by the sum of the new weights. As a result, the weights of mis-
We can repeat the sampling procedure k times, where in each iteration, we use the computed, resulting in the point (0.4, 0). Next, we examine the class label of tuple 3 and
classified tuples are increased and the weights of correctly classified tuples are decreased,
current test set to obtain an accuracy estimate of the model obtained from the current let t be 0.7, the probability value returned by the classifier for that tuple. Thus, tuple 3 is
as described before.
bootstrap sample. The overall accuracy of the model, M, is then estimated as considered positive, yet its actual label is negative, and so it is a false positive. Thus, TP
“Once boosting is complete, how is the ensemble of classifiers used to predict the class label
stays the same and FP increments so that FP = 1. The rest of the values in the row can
of a tuple, X?” Unlike bagging, where each classifier was assigned an equal vote, boosting
k also be easily computed, yielding the point (0.4, 0.2). The resulting ROC graph, from
1X assigns a weight to each classifier’s vote, based on how well the classifier performed. The
Acc(M) = (0.632 × Acc(Mi )test + 0.368 × Acc(Mi )train set ), (8.30) examining each tuple, is the jagged line shown in Figure 8.19.
k
set lower a classifier’s error rate, the more accurate it is, and therefore, the higher its weight
i=1 There are many methods to obtain a curve out of these points, the most common
for voting should be. The weight of classifier Mi ’s vote is
of which is to use a convex hull. The plot also shows a diagonal line where for every
where Acc(Mi )test set is the accuracy of the model obtained with bootstrap sample i when true positive of such a model, we are just as likely to encounter a false positive. For 1 − error(Mi )
comparison, this line represents random guessing. log . (8.35)
it is applied to test set i. Acc(Mi )train set is the accuracy of the model obtained with boot- error(Mi )
strap sample i when it is applied to the original set of data tuples. Bootstrapping tends
to be overly optimistic. It works best with small data sets. Figure 8.20 shows the ROC curves of two classification models. The diagonal line For each class, c, we sum the weights of each classifier that assigned class c to X. The class
representing random guessing is also shown. Thus, the closer the ROC curve of a model with the highest sum is the “winner” and is returned as the class prediction for tuple X.
is to the diagonal line, the less accurate the model. If the model is really good, initially “How does boosting compare with bagging?” Because of the way boosting focuses on
7e is the base of natural logarithms, that is, e = 2.718. we are more likely to encounter true positives as we move down the ranked list. Thus, the misclassified tuples, it risks overfitting the resulting composite model to such data.

372 Chapter 8 Classification: Basic Concepts 8.6 Techniques to Improve Classification Accuracy 377 382 Chapter 8 Classification: Basic Concepts

8.5.5 Model Selection Using Statistical Tests of Significance 1.0 Algorithm: AdaBoost. A boosting algorithm—create an ensemble of classifiers. Each one
gives a weighted vote.
Suppose that we have generated two classification models, M1 and M2 , from our data. M1
Input:
We have performed 10-fold cross-validation to obtain a mean error rate8 for each. How 0.8
can we determine which model is best? It may seem intuitive to select the model with D, a set of d class-labeled training tuples;

True positive rate


the lowest error rate; however, the mean error rates are just estimates of error on the true M2
0.6 k, the number of rounds (one classifier is generated per round);
population of future data cases. There can be considerable variance between error rates
a classification learning scheme.
within any given 10-fold cross-validation experiment. Although the mean error rates
obtained for M1 and M2 may appear different, that difference may not be statistically 0.4 Output: A composite model.
significant. What if any difference between the two may just be attributed to chance?
Method:
This section addresses these questions.
0.2 (1) initialize the weight of each tuple in D to 1/d;
To determine if there is any “real” difference in the mean error rates of two models,
we need to employ a test of statistical significance. In addition, we want to obtain some (2) for i = 1 to k do // for each round:
confidence limits for our mean error rates so that we can make statements like, “Any 0.0 (3) sample D with replacement according to the tuple weights to obtain Di ;
observed mean will not vary by ± two standard errors 95% of the time for future samples” 0.0 0.2 0.4 0.6 0.8 1.0 (4) use training set Di to derive a model, Mi ;
False positive rate (5) compute error(Mi ), the error rate of Mi (Eq. 8.34)
or “One model is better than the other by a margin of error of ± 4%.”
What do we need to perform the statistical test? Suppose that for each model, we (6) if error(Mi ) > 0.5 then
did 10-fold cross-validation, say, 10 times, each time using a different 10-fold data par- (7) go back to step 3 and try again;
Figure 8.20 ROC curves of two classification models, M1 and M2 . The diagonal shows where, for every
titioning. Each partitioning is independently drawn. We can average the 10 error rates (8) endif
true positive, we are equally likely to encounter a false positive. The closer an ROC curve is
obtained each for M1 and M2 , respectively, to obtain the mean error rate for each model. (9) for each tuple in Di that was correctly classified do
to the diagonal line, the less accurate the model is. Thus, M1 is more accurate here.
For a given model, the individual error rates calculated in the cross-validations may be (10) multiply the weight of the tuple by error(Mi )/(1 − error(Mi )); // update weights
considered as different, independent samples from a probability distribution. In gen- (11) normalize the weight of each tuple;
eral, they follow a t-distribution with k − 1 degrees of freedom where, here, k = 10. (This (12) endfor
distribution looks very similar to a normal, or Gaussian, distribution even though the the curve moves steeply up from zero. Later, as we start to encounter fewer and fewer
functions defining the two are quite different. Both are unimodal, symmetric, and bell- true positives, and more and more false positives, the curve eases off and becomes more To use the ensemble to classify tuple, X:
shaped.) This allows us to do hypothesis testing where the significance test used is the horizontal.
To assess the accuracy of a model, we can measure the area under the curve. Several (1) initialize weight of each class to 0;
t-test, or Student’s t-test. Our hypothesis is that the two models are the same, or in other
software packages are able to perform such calculation. The closer the area is to 0.5, the (2) for i = 1 to k do // for each classifier:
words, that the difference in mean error rate between the two is zero. If we can reject this
hypothesis (referred to as the null hypothesis), then we can conclude that the difference less accurate the corresponding model is. A model with perfect accuracy will have an (3) wi = log 1−error(M i)
error(M ) ; // weight of the classifier’s vote
i
between the two models is statistically significant, in which case we can select the model area of 1.0. (4) c = Mi (X); // get class prediction for X from Mi
with the lower error rate. (5) add wi to weight for class c
In data mining practice, we may often employ a single test set, that is, the same (6) endfor
test set can be used for both M1 and M2 . In such cases, we do a pairwise compari-
son of the two models for each 10-fold cross-validation round. That is, for the ith round
of 10-fold cross-validation, the same cross-validation partitioning is used to obtain an
8.6 Techniques to Improve Classification Accuracy (7) return the class with the largest weight;

In this section, you will learn some tricks for increasing classification accuracy. We focus Figure 8.24 AdaBoost, a boosting algorithm.
error rate for M1 and for M2 . Let err(M1 )i (or err(M2 )i ) be the error rate of model M1
on ensemble methods. An ensemble for classification is a composite model, made up of
(or M2 ) on round i. The error rates for M1 are averaged to obtain a mean error rate for
a combination of classifiers. The individual classifiers vote, and a class label prediction
M1 , denoted err(M1 ). Similarly, we can obtain err(M2 ). The variance of the difference Therefore, sometimes the resulting “boosted” model may be less accurate than a single
is returned by the ensemble based on the collection of votes. Ensembles tend to be more
between the two models is denoted var(M1 − M2 ). The t-test computes the t-statistic model derived from the same data. Bagging is less susceptible to model overfitting. While
accurate than their component classifiers. We start off in Section 8.6.1 by introducing
with k − 1 degrees of freedom for k samples. In our example we have k = 10 since, here, both can significantly improve accuracy in comparison to a single model, boosting tends
ensemble methods in general. Bagging (Section 8.6.2), boosting (Section 8.6.3), and
the k samples are our error rates obtained from ten 10-fold cross-validations for each to achieve greater accuracy.
random forests (Section 8.6.4) are popular ensemble methods.
Traditional learning models assume that the data classes are well distributed. In
8 Recall many real-world data domains, however, the data are class-imbalanced, where the
8.6.4 Random Forests
that the error rate of a model, M, is 1 − accuracy(M).
main class of interest is represented by only a few tuples. This is known as the class We now present another ensemble method called random forests. Imagine that each of
the classifiers in the ensemble is a decision tree classifier so that the collection of classifiers

8.5 Model Evaluation and Selection 373 378 Chapter 8 Classification: Basic Concepts 8.6 Techniques to Improve Classification Accuracy 383

model. The t-statistic for pairwise comparison is computed as follows: imbalance problem. We also study techniques for improving the classification accuracy is a “forest.” The individual decision trees are generated using a random selection of
err(M1 ) − err(M2 ) of class-imbalanced data. These are presented in Section 8.6.5. attributes at each node to determine the split. More formally, each tree depends on the
t=p , (8.31) values of a random vector sampled independently and with the same distribution for
var(M1 − M2 )/k all trees in the forest. During classification, each tree votes and the most popular class is
where
8.6.1 Introducing Ensemble Methods returned.
k Bagging, boosting, and random forests are examples of ensemble methods (Figure 8.21). Random forests can be built using bagging (Section 8.6.2) in tandem with random
1X An ensemble combines a series of k learned models (or base classifiers), M1 , M2 , . . . , Mk , attribute selection. A training set, D, of d tuples is given. The general procedure to gen-
var(M1 − M2 ) = [err(M1 )i − err(M2 )i − (err(M1 ) − err(M2 ))]2 . (8.32)
k with the aim of creating an improved composite classification model, M∗. A given data erate k decision trees for the ensemble is as follows. For each iteration, i (i = 1, 2, . . . , k),
i=1
set, D, is used to create k training sets, D1 , D2 , . . . , Dk , where Di (1 ≤ i ≤ k − 1) is used a training set, Di , of d tuples is sampled with replacement from D. That is, each Di is a
To determine whether M1 and M2 are significantly different, we compute t and select to generate classifier Mi . Given a new data tuple to classify, the base classifiers each vote bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once
a significance level, sig. In practice, a significance level of 5% or 1% is typically used. We by returning a class prediction. The ensemble returns a class prediction based on the in Di , while others may be excluded. Let F be the number of attributes to be used to
then consult a table for the t-distribution, available in standard textbooks on statistics. votes of the base classifiers. determine the split at each node, where F is much smaller than the number of avail-
This table is usually shown arranged by degrees of freedom as rows and significance An ensemble tends to be more accurate than its base classifiers. For example, con- able attributes. To construct a decision tree classifier, Mi , randomly select, at each node,
levels as columns. Suppose we want to ascertain whether the difference between M1 and sider an ensemble that performs majority voting. That is, given a tuple X to classify, it F attributes as candidates for the split at the node. The CART methodology is used to
M2 is significantly different for 95% of the population, that is, sig = 5% or 0.05. We collects the class label predictions returned from the base classifiers and outputs the class grow the trees. The trees are grown to maximum size and are not pruned. Random
need to find the t-distribution value corresponding to k − 1 degrees of freedom (or 9 in majority. The base classifiers may make mistakes, but the ensemble will misclassify X forests formed this way, with random input selection, are called Forest-RI.
degrees of freedom for our example) from the table. However, because the t-distribution only if over half of the base classifiers are in error. Ensembles yield better results when Another form of random forest, called Forest-RC, uses random linear combinations
is symmetric, typically only the upper percentage points of the distribution are shown. there is significant diversity among the models. That is, ideally, there is little correla- of the input attributes. Instead of randomly selecting a subset of the attributes, it cre-
Therefore, we look up the table value for z = sig/2, which in this case is 0.025, where tion among classifiers. The classifiers should also perform better than random guessing. ates new attributes (or features) that are a linear combination of the existing attributes.
z is also referred to as a confidence limit. If t > z or t < −z, then our value of t lies Each base classifier can be allocated to a different CPU and so ensemble methods are That is, an attribute is generated by specifying L, the number of original attributes to be
in the rejection region, within the distribution’s tails. This means that we can reject the parallelizable. combined. At a given node, L attributes are randomly selected and added together with
null hypothesis that the means of M1 and M2 are the same and conclude that there is To help illustrate the power of an ensemble, consider a simple two-class problem coefficients that are uniform random numbers on [−1, 1]. F linear combinations are
a statistically significant difference between the two models. Otherwise, if we cannot described by two attributes, x1 and x2 . The problem has a linear decision boundary. generated, and a search is made over these for the best split. This form of random forest
reject the null hypothesis, we conclude that any difference between M1 and M2 can be Figure 8.22(a) shows the decision boundary of a decision tree classifier on the problem. is useful when there are only a few attributes available, so as to reduce the correlation
attributed to chance. Figure 8.22(b) shows the decision boundary of an ensemble of decision tree classifiers between individual classifiers.
If two test sets are available instead of a single test set, then a nonpaired version of the on the same problem. Although the ensemble’s decision boundary is still piecewise Random forests are comparable in accuracy to AdaBoost, yet are more robust to
t-test is used, where the variance between the means of the two models is estimated as constant, it has a finer resolution and is better than that of a single tree. errors and outliers. The generalization error for a forest converges as long as the num-
s ber of trees in the forest is large. Thus, overfitting is not a problem. The accuracy of a
var(M1 ) var(M2 )
var(M1 − M2 ) = + , (8.33) random forest depends on the strength of the individual classifiers and a measure of the
k 1k 2 M1 dependence between them. The ideal is to maintain the strength of individual classifiers
New data
D1 tuple without increasing their correlation. Random forests are insensitive to the number of
and k1 and k2 are the number of cross-validation samples (in our case, 10-fold cross-
attributes selected for consideration at each split. Typically, up to log2 d + 1 are chosen.
validation rounds) used for M1 and M2 , respectively. This is also known as the two
(An interesting empirical observation was that using a single random input attribute
sample t-test.9 When consulting the table of t-distribution, the number of degrees of D2 M2
may result in good accuracy that is often higher than when using several attributes.)
freedom used is taken as the minimum number of degrees of the two models.
Because random forests consider many fewer attributes for each split, they are efficient
• Combine on very large databases. They can be faster than either bagging or boosting. Random
Data, D Prediction
8.5.6 Comparing Classifiers Based on Cost–Benefit •
votes forests give internal estimates of variable importance.
and ROC Curves Dk

The true positives, true negatives, false positives, and false negatives are also useful in Mk 8.6.5 Improving Classification Accuracy of Class-Imbalanced Data
assessing the costs and benefits (or risks and gains) associated with a classification
In this section, we revisit the class imbalance problem. In particular, we study approaches
to improving the classification accuracy of class-imbalanced data.
9 This test was used in sampling cubes for OLAP-based mining in Chapter 5. Figure 8.21 Increasing classifier accuracy: Ensemble methods generate a set of classification models, Given two-class data, the data are class-imbalanced if the main class of interest (the
M1 , M2 , . . . , Mk . Given a new data tuple to classify, each classifier “votes” for the class label positive class) is represented by only a few tuples, while the majority of tuples represent
of that tuple. The ensemble combines the votes to return a class prediction. the negative class. For multiclass-imbalanced data, the data distribution of each class

374 Chapter 8 Classification: Basic Concepts 8.6 Techniques to Improve Classification Accuracy 379 384 Chapter 8 Classification: Basic Concepts

model. The cost associated with a false negative (such as incorrectly predicting that a 1.0 1.0 differs substantially where, again, the main class or classes of interest are rare. The
cancerous patient is not cancerous) is far greater than those of a false positive class imbalance problem is closely related to cost-sensitive learning, wherein the costs of
(incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such 0.8 0.8 errors, per class, are not equal. In medical diagnosis, for example, it is much more costly
cases, we can outweigh one type of error over another by assigning a different cost to to falsely diagnose a cancerous patient as healthy (a false negative) than to misdiagnose
each. These costs may consider the danger to the patient, financial costs of resulting 0.6 0.6 a healthy patient as having cancer (a false positive). A false negative error could lead to
x2

x2

therapies, and other hospital costs. Similarly, the benefits associated with a true positive the loss of life and therefore is much more expensive than a false positive error. Other
decision may be different than those of a true negative. Up to now, to compute classifier 0.4 0.4 applications involving class-imbalanced data include fraud detection, the detection of
accuracy, we have assumed equal costs and essentially divided the sum of true positives oil spills from satellite radar images, and fault monitoring.
and true negatives by the total number of test tuples. 0.2 0.2 Traditional classification algorithms aim to minimize the number of errors made dur-
Alternatively, we can incorporate costs and benefits by instead computing the average ing classification. They assume that the costs of false positive and false negative errors
cost (or benefit) per decision. Other applications involving cost–benefit analysis include 0.0 0.0 are equal. By assuming a balanced distribution of classes and equal error costs, they
loan application decisions and target marketing mailouts. For example, the cost of loan- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 are therefore not suitable for class-imbalanced data. Earlier parts of this chapter pre-
ing to a defaulter greatly exceeds that of the lost business incurred by denying a loan to a x1 x1 sented ways of addressing the class imbalance problem. Although the accuracy measure
nondefaulter. Similarly, in an application that tries to identify households that are likely (a) (b) assumes that the cost of classes are equal, alternative evaluation metrics can be used that
to respond to mailouts of certain promotional material, the cost of mailouts to numer- consider the different types of classifications. Section 8.5.1, for example, presented sensi-
ous households that do not respond may outweigh the cost of lost business from not Figure 8.22 Decision boundary by (a) a single decision tree and (b) an ensemble of decision trees for a tivity or recall (the true positive rate) and specificity (the true negative rate), which help
mailing to households that would have responded. Other costs to consider in the overall linearly separable problem (i.e., where the actual decision boundary is a straight line). The to assess how well a classifier can predict the class label of imbalanced data. Additional
analysis include the costs to collect the data and to develop the classification tool. decision tree struggles with approximating a linear boundary. The decision boundary of the relevant measures discussed include F1 and Fβ . Section 8.5.6 showed how ROC curves
Receiver operating characteristic curves are a useful visual tool for comparing two ensemble is closer to the true boundary. Source: From Seni and Elder [SE10]. c 2010 Morgan plot sensitivity versus 1 − specificity (i.e., the false positive rate). Such curves can provide
classification models. ROC curves come from signal detection theory that was deve- & Claypool Publishers; used with permission. insight when studying the performance of classifiers on class-imbalanced data.
loped during World War II for the analysis of radar images. An ROC curve for a given In this section, we look at general approaches for improving the classification accu-
model shows the trade-off between the true positive rate (TPR) and the false positive rate racy of class-imbalanced data. These approaches include (1) oversampling, (2) under-
(FPR).10 Given a test set and a model, TPR is the proportion of positive (or “yes”) tuples sampling, (3) threshold moving, and (4) ensemble techniques. The first three do not
that are correctly labeled by the model; FPR is the proportion of negative (or “no”)
8.6.2 Bagging involve any changes to the construction of the classification model. That is, oversam-
tuples that are mislabeled as positive. Given that TP, FP, P, and N are the number of We now take an intuitive look at how bagging works as a method of increasing accuracy. pling and undersampling change the distribution of tuples in the training set; threshold
true positive, false positive, positive, and negative tuples, respectively, from Section 8.5.1 Suppose that you are a patient and would like to have a diagnosis made based on your moving affects how the model makes decisions when classifying new data. Ensemble
FP
we know that TPR = TP P , which is sensitivity. Furthermore, FPR = N , which is symptoms. Instead of asking one doctor, you may choose to ask several. If a certain methods follow the techniques described in Sections 8.6.2 through 8.6.4. For ease of
1 − specificity. diagnosis occurs more than any other, you may choose this as the final or best diagnosis. explanation, we describe these general approaches with respect to the two-class imbal-
For a two-class problem, an ROC curve allows us to visualize the trade-off between That is, the final diagnosis is made based on a majority vote, where each doctor gets an ance data problem, where the higher-cost classes are rarer than the lower-cost classes.
the rate at which the model can accurately recognize positive cases versus the rate at equal vote. Now replace each doctor by a classifier, and you have the basic idea behind Both oversampling and undersampling change the training data distribution so that
which it mistakenly identifies negative cases as positive for different portions of the test bagging. Intuitively, a majority vote made by a large group of doctors may be more the rare (positive) class is well represented. Oversampling works by resampling the pos-
set. Any increase in TPR occurs at the cost of an increase in FPR. The area under the reliable than a majority vote made by a small group. itive tuples so that the resulting training set contains an equal number of positive and
ROC curve is a measure of the accuracy of the model. Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2, . . . , k), negative tuples. Undersampling works by decreasing the number of negative tuples. It
To plot an ROC curve for a given classification model, M, the model must be able to a training set, Di , of d tuples is sampled with replacement from the original set of randomly eliminates tuples from the majority (negative) class until there are an equal
return a probability of the predicted class for each test tuple. With this information, we tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training number of positive and negative tuples.
rank and sort the tuples so that the tuple that is most likely to belong to the positive or set is a bootstrap sample, as described in Section 8.5.4. Because sampling with replace-
“yes” class appears at the top of the list, and the tuple that is least likely to belong to the ment is used, some of the original tuples of D may not be included in Di , whereas others Example 8.12 Oversampling and undersampling. Suppose the original training set contains 100 pos-
positive class lands at the bottom of the list. Naı̈ve Bayesian (Section 8.3) and backpropa- may occur more than once. A classifier model, Mi , is learned for each training set, Di . itive and 1000 negative tuples. In oversampling, we replicate tuples of the rarer class
gation (Section 9.2) classifiers return a class probability distribution for each prediction To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which to form a new training set containing 1000 positive tuples and 1000 negative tuples.
and, therefore, are appropriate, although other classifiers, such as decision tree classifiers counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class In undersampling, we randomly eliminate negative tuples so that the new training set
(Section 8.2), can easily be modified to return class probability predictions. Let the value with the most votes to X. Bagging can be applied to the prediction of continuous values contains 100 positive tuples and 100 negative tuples.
by taking the average value of each prediction for a given test tuple. The algorithm is
summarized in Figure 8.23.
10 TPR and FPR are the two operating characteristics being compared. The bagged classifier often has significantly greater accuracy than a single classifier Several variations to oversampling and undersampling exist. They may vary, for
derived from D, the original training data. It will not be considerably worse and is more instance, in how tuples are added or eliminated. For example, the SMOTE algorithm

8.5 Model Evaluation and Selection 375 380 Chapter 8 Classification: Basic Concepts 8.7 Summary 385

that a probabilistic classifier returns for a given tuple X be f (X) → [0, 1]. For a binary Algorithm: Bagging. The bagging algorithm—create an ensemble of classification models uses oversampling where synthetic tuples are added, which are “close to” the given
for a learning scheme where each model gives an equally weighted prediction.
problem, a threshold t is typically selected so that tuples where f (X) ≥ t are considered positive tuples in tuple space.
positive and all the other tuples are considered negative. Note that the number of true Input: The threshold-moving approach to the class imbalance problem does not involve
positives and the number of false positives are both functions of t, so that we could write any sampling. It applies to classifiers that, given an input tuple, return a continuous
D, a set of d training tuples;
TP(t) and FP(t). Both are monotonic descending functions. output value (just like in Section 8.5.6, where we discussed how to construct ROC
We first describe the general idea behind plotting an ROC curve, and then follow up k, the number of models in the ensemble; curves). That is, for an input tuple, X, such a classifier returns as output a mapping,
with an example. The vertical axis of an ROC curve represents TPR. The horizontal axis a classification learning scheme (decision tree algorithm, naı̈ve Bayesian, etc.). f (X) → [0, 1]. Rather than manipulating the training tuples, this method returns a clas-
represents FPR. To plot an ROC curve for M, we begin as follows. Starting at the bottom sification decision based on the output values. In the simplest approach, tuples for which
left corner (where TPR = FPR = 0), we check the tuple’s actual class label at the top of Output: The ensemble—a composite model, M∗. f (X) ≥ t, for some threshold, t, are considered positive, while all other tuples are con-
the list. If we have a true positive (i.e., a positive tuple that was correctly classified), then Method: sidered negative. Other approaches may involve manipulating the outputs by weighting.
TP and thus TPR increase. On the graph, we move up and plot a point. If, instead, the (1) for i = 1 to k do // create k models: In general, threshold moving moves the threshold, t, so that the rare class tuples are eas-
model classifies a negative tuple as positive, we have a false positive, and so both FP and (2) create bootstrap sample, Di , by sampling D with replacement; ier to classify (and hence, there is less chance of costly false negative errors). Examples of
FPR increase. On the graph, we move right and plot a point. This process is repeated (3) use Di and the learning scheme to derive a model, Mi ; such classifiers include naı̈ve Bayesian classifiers (Section 8.3) and neural network clas-
for each of the test tuples in ranked order, each time moving up on the graph for a true (4) endfor sifiers like backpropagation (Section 9.2). The threshold-moving method, although not
positive or toward the right for a false positive. as popular as over- and undersampling, is simple and has shown some success for the
To use the ensemble to classify a tuple, X: two-class-imbalanced data.
Example 8.11 Plotting an ROC curve. Figure 8.18 shows the probability value (column 3) returned Ensemble methods (Sections 8.6.2 through 8.6.4) have also been applied to the class
by a probabilistic classifier for each of the 10 tuples in a test set, sorted by decreasing let each of the k models classify X and return the majority vote; imbalance problem. The individual classifiers making up the ensemble may include
probability order. Column 1 is merely a tuple identification number, which aids in our versions of the approaches described here such as oversampling and threshold moving.
explanation. Column 2 is the actual class label of the tuple. There are five positive tuples These methods work relatively well for the class imbalance problem on two-class
Figure 8.23 Bagging.
and five negative tuples, thus P = 5 and N = 5. As we examine the known class label tasks. Threshold-moving and ensemble methods were empirically observed to outper-
of each tuple, we can determine the values of the remaining columns, TP, FP, TN , FN , form oversampling and undersampling. Threshold moving works well even on data
TPR, and FPR. We start with tuple 1, which has the highest probability score, and take robust to the effects of noisy data and overfitting. The increased accuracy occurs because sets that are extremely imbalanced. The class imbalance problem on multiclass tasks
that score as our threshold, that is, t = 0.9. Thus, the classifier considers tuple 1 to be the composite model reduces the variance of the individual classifiers. is much more difficult, where oversampling and threshold moving are less effective.
positive, and all the other tuples are considered negative. Since the actual class label Although threshold-moving and ensemble methods show promise, finding a solution
of tuple 1 is positive, we have a true positive, hence TP = 1 and FP = 0. Among the for the multiclass imbalance problem remains an area of future work.
8.6.3 Boosting and AdaBoost
Tuple #
1
Class
P
Prob.
0.90
TP
1
FP
0
TN
5
FN
4
TPR
0.2
FPR
0
We now look at the ensemble method of boosting. As in the previous section, suppose
that as a patient, you have certain symptoms. Instead of consulting one doctor, you
choose to consult several. Suppose you assign weights to the value or worth of each doc-
8.7 Summary
Classification is a form of data analysis that extracts models describing data classes.
2 P 0.80 2 0 5 3 0.4 0 tor’s diagnosis, based on the accuracies of previous diagnoses they have made. The final
3 N 0.70 2 1 4 3 0.4 0.2 diagnosis is then a combination of the weighted diagnoses. This is the essence behind A classifier, or classification model, predicts categorical labels (classes). Numeric pre-
4 P 0.60 3 1 4 2 0.6 0.2 boosting. diction models continuous-valued functions. Classification and numeric prediction
In boosting, weights are also assigned to each training tuple. A series of k classifiers is are the two major types of prediction problems.
5 P 0.55 4 1 4 1 0.8 0.2
iteratively learned. After a classifier, Mi , is learned, the weights are updated to allow the Decision tree induction is a top-down recursive tree induction algorithm, which
6 N 0.54 4 2 3 1 0.8 0.4
subsequent classifier, Mi+1 , to “pay more attention” to the training tuples that were mis- uses an attribute selection measure to select the attribute tested for each nonleaf node
7 N 0.53 4 3 2 1 0.8 0.6
classified by Mi . The final boosted classifier, M∗, combines the votes of each individual in the tree. ID3, C4.5, and CART are examples of such algorithms using different
8 N 0.51 4 4 1 1 0.8 0.8
classifier, where the weight of each classifier’s vote is a function of its accuracy. attribute selection measures. Tree pruning algorithms attempt to improve accuracy
9 P 0.50 5 4 0 1 1.0 0.8
AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose by removing tree branches reflecting noise in the data. Early decision tree algorithms
10 N 0.40 5 5 0 0 1.0 1.0 we want to boost the accuracy of a learning method. We are given D, a data set of typically assume that the data are memory resident. Several scalable algorithms, such
d class-labeled tuples, (X1 , y1 ), (X2 , y2 ), . . . , (Xd , yd ), where yi is the class label of tuple as RainForest, have been proposed for scalable tree induction.
Xi . Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating
Figure 8.18 Tuples sorted by decreasing score, where the score is the value returned by a probabilistic k classifiers for the ensemble requires k rounds through the rest of the algorithm. In Naı̈ve Bayesian classification is based on Bayes’ theorem of posterior probability. It
classifier. round i, the tuples from D are sampled to form a training set, Di , of size d. Sampling assumes class-conditional independence—that the effect of an attribute value on a
given class is independent of the values of the other attributes.
386 Chapter 8 Classification: Basic Concepts

A rule-based classifier uses a set of IF-THEN rules for classification. Rules can be
extracted from a decision tree. Rules may also be generated directly from training

Decision Tree Example 2 Decision Tree for PlayTennis


data using sequential covering algorithms.
A confusion matrix can be used to evaluate a classifier’s quality. For a two-class
problem, it shows the true positives, true negatives, false positives, and false negatives.
Measures that assess a classifier’s predictive ability include accuracy, sensitivity (also
known as recall), specificity, precision, F, and Fβ . Reliance on the accuracy measure

Attributes and their values:


can be deceiving when the main class of interest is in the minority.
Construction and evaluation of a classifier require partitioning labeled data into 
a training set and a test set. Holdout, random sampling, cross-validation, and

 Outlook: Sunny, Overcast, Rain


bootstrapping are typical methods used for such partitioning.
Significance tests and ROC curves are useful tools for model selection. Significance
tests can be used to assess whether the difference in accuracy between two classifiers

 Humidity: High, Normal


is due to chance. ROC curves plot the true positive rate (or sensitivity) versus the
false positive rate (or 1 − specificity) of one or more classifiers.
Ensemble methods can be used to increase overall accuracy by learning and combin-

 Wind: Strong, Weak


ing a series of individual (base) classifier models. Bagging, boosting, and random
forests are popular ensemble methods.
The class imbalance problem occurs when the main class of interest is represented

 Temperature: Hot, Mild, Cool


by only a few tuples. Strategies to address this problem include oversampling,
undersampling, threshold moving, and ensemble techniques.

8.8 Exercises

8.1 Briefly outline the major steps of decision tree classification.  Target concept - Play Tennis: Yes, No
8.2 Why is tree pruning useful in decision tree induction? What is a drawback of using a
separate set of tuples to evaluate pruning?
8.3 Given a decision tree, you have the option of (a) converting the decision tree to rules and
then pruning the resulting rules, or (b) pruning the decision tree and then converting
the pruned tree to rules. What advantage does (a) have over (b)?
8.4 It is important to calculate the worst-case computational complexity of the decision tree
algorithm. Given data set, D, the number of attributes, n, and the number of training
tuples, |D|, show that the computational cost of growing a tree is at most n × |D| ×
log(|D|).
8.5 Given a 5-GB data set with 50 attributes (each containing 100 distinct values) and 512
MB of main memory in your laptop, outline an efficient method that constructs deci-
sion trees in such large data sets. Justify your answer by rough calculation of your main
memory usage.

8.8 Exercises 387

8.6 Why is naı̈ve Bayesian classification called “naı̈ve”? Briefly outline the major ideas of
naı̈ve Bayesian classification.
8.7 The following table consists of training data from an employee database. The data have
been generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35.
For a given row entry, count represents the number of data tuples having the values for
Issues Decision Tree for PlayTennis
department, status, age, and salary given in that row.

Given some training examples, what decision tree


department status age salary count
sales
sales
senior
junior
31 . . . 35
26 . . . 30
46K . . . 50K
26K . . . 30K
30
40

sales
systems
junior
junior
31 . . . 35
21 . . . 25
31K . . . 35K
46K . . . 50K
40
20 should be generated? Outlook
systems senior 31 . . . 35 66K . . . 70K 5
systems
systems
junior
senior
26 . . . 30
41 . . . 45
46K . . . 50K
66K . . . 70K
3
3
 Intuition: Prefer the smallest tree that is consistent
marketing
marketing
senior
junior
36 . . . 40
31 . . . 35
46K . . . 50K
41K . . . 45K
10
4 with the data Sunny Overcast Rain
secretary senior 46 . . . 50 36K . . . 40K 4
secretary junior 26 . . . 30 26K . . . 30K 6
 the tree with the least depth?
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into consideration
the count of each generalized data tuple (i.e., of each row entry)?
 the tree with the fewest nodes?
Humidity Yes Wind
Possible method:
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems,” “26 . . . 30,” and “46–50K” for the 
attributes department, age, and salary, respectively, what would a naı̈ve Bayesian
classification of the status for the tuple be?
8.8 RainForest is a scalable algorithm for decision tree induction. Develop a scalable naı̈ve
 search the space of decision trees for the smallest decision
Bayesian classification algorithm that requires just a single scan of the entire data set
for most databases. Discuss whether such an algorithm can be refined to incorporate tree that fits the data High Normal Strong Weak
boosting to further enhance its classification accuracy.
8.9 Design an efficient method that performs effective naı̈ve Bayesian classification over
an infinite data stream (i.e., you can scan the data stream only once). If we wanted
to discover the evolution of such classification schemes (e.g., comparing the classifica-
tion scheme at this moment with earlier schemes such as one from a week ago), what No Yes No Yes
modified design would you suggest?
8.10 Show that accuracy is a function of sensitivity and specificity, that is, prove Eq. (8.25).
8.11 The harmonic mean is one of several kinds of averages. Chapter 2 discussed how to
compute the arithmetic mean, which is what most people typically think of when they
compute an average. The harmonic mean, H, of the positive real numbers, x1 , x2 , . . . , xn ,

388 Chapter 8 Classification: Basic Concepts

is defined as
n

Example Data Decision Tree for PlayTennis


H= 1
x1 + x12 + · · · + x1n
n
= Pn 1
.
i=1 xi

The F measure is the harmonic mean of precision and recall. Use this fact to derive
Eq. (8.28) for F. In addition, write Fβ as a function of true positives, false negatives, and
false positives.
8.12 The data tuples of Figure 8.25 are sorted by decreasing probability value, as returned by
Training Examples:
a classifier. For each tuple, compute the values for the number of true positives (TP),
Outlook
false positives (FP), true negatives (TN ), and false negatives (FN ). Compute the true
positive rate (TPR) and false positive rate (FPR). Plot the ROC curve for the data.
Action Author Thread Length Where
8.13 It is difficult to assess classification accuracy when individual data objects may belong to
more than one class at a time. In such cases, comment on what criteria you would use
e1 skips known new long Home
to compare different classifiers modeled after the same data.
e2 reads unknown new short Work
8.14 Suppose that we want to select between two prediction models, M1 and M2 . We have Sunny Overcast Rain
performed 10 rounds of 10-fold cross-validation on each model, where the same data
partitioning in round i is used for both M1 and M2 . The error rates obtained for M1 are
e3 skips unknown old long Work
30.5, 32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, 26.0. The error rates for M2 are 22.4,
14.5, 22.4, 19.6, 20.7, 20.4, 22.1, 19.4, 16.2, 35.0. Comment on whether one model is
e4 skips known old long home
significantly better than the other considering a significance level of 1%.
e5 reads known new short home
Humidity Each internal node tests an attribute
8.15 What is boosting? State why it may improve the accuracy of decision tree induction.
e6 skips known old long work
Tuple # Class Probability
1
2
P
N
0.95
0.85
New Examples:
3
4
P
P
0.78
0.66
e7 ??? known new short work High Normal Each branch corresponds to an
5
6
N
P
0.60
0.55
e8 ??? unknown new short work attribute value node
7 N 0.53
8
9
N
N
0.52
0.51
No Yes Each leaf node assigns a classification
10 P 0.40

Figure 8.25 Tuples sorted by decreasing score, where the score is the value returned by a
probabilistic classifier.

Possible splits Decision Tree for PlayTennis

Outlook Temperature Humidity Wind PlayTennis


Classification: length
skips 9
reads 9
Sunny Hot High Weak ?
Outlook

Decision Tree and Its long short thread


skips 9
reads 9
Sunny Overcast Rain

Variants Skips 7
Reads 0
Skips 2
Reads 9
new old
Ref, Source & Ack: Humidity Yes Wind
Skips 3 Skips 6
Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Reads 7 Reads 2

Urbana-Champaign & Simon Fraser University, USA. High Normal Strong Weak

No Yes No Yes
1

Decision Tree Example 1 Two Example DTs Decision Tree


Whether to approve a loan decision trees represent disjunctions of conjunctions
Employed Outlook
?
Yes Sunny Overcast Rain
No

Humidity Yes Wind


Credit
Income?
Score? High Normal Strong Weak
High Low High Low
No Yes No Yes

(Outlook=Sunny  Humidity=Normal)
Approve Reject Approve Reject  (Outlook=Overcast)
 (Outlook=Rain  Wind=Weak)
Choices
Brief Review of Entropy
 Attribute Selection Method: Splitting Criteria

 Determine best way to split/partition D into individual
classes
 Partitions at each branch should be as "pure" as
possible
 Pure: all tuples in it belong to the same class

 Conditions for stopping partitioning


 All samples for a given node belong to the same class

 There are no remaining attributes for further


partitioning – majority voting is employed for classifying
the leaf
 There are no samples left
m=2

13 23

Attribute Selection Measure:


Decision Tree: Starting & Algo
Information Gain (ID3/C4.5)
• In early 1980s, J.Ross Quinlan developed decision tree
algo called ID3(Iterative Dichotomiser)  Select the attribute with the highest information gain
• Successor of ID3=> C4.5 (became a benchmark)  Let pi be the probability that an arbitrary tuple in D belongs to
• 1984: book titled “Classification and regression trees class Ci, estimated by |Ci, D|/|D|
(CART)  Expected information (entropy) needed to classify a tuple in D:
m

• ID3 and CART similar, but independently developed Info ( D )    pi log 2 ( pi )


i 1
• ID3, C4.5, CART adopt a greedy approach & construct  Information needed (after using A to split D into v partitions) to
classify D: v | D |
top-down recursive Divide & Conquer manner Info A ( D )   j
 Info ( D j )
• ID3; Algo: uses 3 parameters D (data partition). j 1 | D |
 Information gained by branching on attribute A
attribute_list, attribute_selection_method (heuristic
procedure to select the "best“ discriminating attribute Gain(A)  Info(D)  Info A(D)
19 24

Decision Tree Induction: An Example:


Algorithm for Decision Tree Induction Buy Computer Attribute Selection: Information Gain
age income student credit_rating buys_computer
 Basic algorithm (a greedy algorithm) <=30 high no fair no  Class P: buys_computer = “yes”
age
<=30
income student credit_rating
high no fair
buys_computer
no
 Tree is constructed in a top-down recursive divide-and- <=30 high no excellent no  Class N: buys_computer = “no” <=30 high no excellent no
31…40 high no fair yes
conquer manner 31…40 high no fair yes 9 9 5 5
Info ( D )  I (9,5 )   log 2 ( )  log 2 ( )  0 .940>40 medium no fair yes
14 14 14 14
 At start, all the training examples are at the root
>40 low yes fair yes
>40 medium no fair yes >40 low yes excellent no
 Attributes are categorical (if continuous-valued, they are  Now compute Infoage(D)
>40 low yes fair yes 31…40 low yes excellent yes

discretized in advance) >40 low yes excellent no


v | Dj | <=30 medium no fair no

 Examples are partitioned recursively based on selected


31…40 low yes excellent yes
Info A ( D )  
j 1 |D|
 Info ( D j ) <=30
>40
<=30
low
medium
medium
yes fair
yes fair
yes excellent
yes
yes
yes
attributes <=30 medium no fair no 31…40 medium no excellent yes
age pi n i I(p i, ni) 31…40 high yes fair yes
 Test attributes are selected on the basis of a heuristic or
<=30 low yes fair yes <=30 2 3 0.971
>40 medium no excellent no
statistical measure (e.g., information gain) >40 medium yes fair yes 31…40 4 0 0
<=30 medium yes excellent yes >40 3 2 0.971
31…40 medium no excellent yes
31…40 high yes fair yes 5 2 2 3 3 4 5
Info age (D )  (  log 2  log 2 ) I ( 4 ,0 )  I ( 3 , 2 )  0 . 694
>40 medium no excellent no 14 5 5 5 5 14 14
15 20 25

Basic Approach at a particular node Decision Tree Induction: An Example Attribute Selection: Information Gain
age income student credit_rating buys_computer
age pi n i I(p i, ni)
1. A  the “best” decision attribute for next node  Training data set: buys_computer <=30 high no fair no
<=30 high no excellent no <=30 2 3 0.971
2. Assign A as decision attribute for node  The data set follows an example of 31…40
>40
high
medium
no fair
no fair
yes
yes 31…40 4 0 0
Quinlan’s ID3 (Playing Tennis) >40 3 2 0.971
3. For each value of A create new descendant
>40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
5
4. Sort training examples to leaf node according to age? <=30 medium no fair no I ( 2,3) means “age <=30” has 5 out of 14 samples, with 2 yes’es
<=30 low yes fair yes 14
the attribute value of the branch >40
<=30
medium yes fair
medium yes excellent
yes
yes
and 3 no’s. So I ( 2 ,3 )  
2 2 3 3
log 2 ( )  log 2 ( )
<=30 overcast 5 3 3 5
5. If all training examples are perfectly classified 31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
Gain(age)  Info( D)  Infoage ( D)  0.246
(same value of target attribute) stop, else iterate
>40 medium no excellent no

student? yes credit rating? Similarly Gain (income )  0.029


over new leaf nodes.
excellent fair
Gain ( student )  0.151 Gain (credit _ rating )  0.048
no yes  AGE has the highest info gain, hence becomes ate splitting
no yes yes attribute at the root node
21 26

Computing Information-Gain for


Attribute Selection: Crucial but How?
Continuous-Valued Attributes
 ID3/C4.5: used Information gain as the criteria  Let attribute A be a continuous-valued attribute
 Information Gain: based on the concept of Entropy  Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
*For detailed and complete algo, refer Chapter 8 of the book the set of tuples in D satisfying A > split-point
"Data Mining: Concepts and Techniques" by Han, Kamber & Pei. 17 22 27
Tree Pruning
Computation of Gini Index
 Cost complexity Algo for CART:
 Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”  Example of post-pruning
9 5
gini( D)  1        0.459
 14   14   Cost complexity as a function of no of leaves & error rate
 Now compute Ginni index for each attribute. For each attribute,
 Error rate: % of tuples misclassified by the tree
there will be multiple partitions possible based on the splitting
subset, we need to compute Ginni index for each Method:
 Let us consider attribute income. For it, binary split/partition on  starts from the bottom of the tree.

income is be done (Ginni Index uses binary split always), and two  For each internal node, N, computes the cost complexity
partitions D1 and D2 are to be generat.ed of the subtree at N, and the cost complexity of the
 What can be possibilities from D1 and D2? subtree at N if it were to be pruned (i.e., replaced by a
 Every possible subset i.e. everyone out of poset(except empty & leaf node).
full subsets) (2v -2 , if attribute A has v possible values)  The two values are compared. If pruning the subtree at
 For example, income has 3 possible values: {low, medium, high} node N would result in a smaller cost complexity, then the
 Compute Ginni index for each possible partition and select one subtree is pruned. Otherwise, it is kept.
28 with minimum Ginni index. 33 38

Tree Pruning
Gain Ratio for Attribute Selection (C4.5) Computation of Gini Index
Pessimistic Pruning (used by C4.5):
 Information gain measure is biased towards attributes with a large
Suppose the attribute income partitions D into 10 in D1: {low,
similar to the cost complexity method, it also uses error

number of values (prefers to select attr. having large no of values) 
medium} and 4 in D2
 For ex, product_id ranging 1..n => n partitions, and each will be pure)  10  4
rate estimates to make decisions
giniincome{low,medium} ( D )   Gini ( D1 )   Gini ( D2 )
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem  14   14   But it does not require the use of a prune set.
(normalization to information gain)
 Instead, uses the training set to estimate error rates.
v | Dj | | Dj | Reason: estimate of accuracy or error based on the
SplitInfo A ( D)    log 2 ( ) 
j 1 |D| |D|  Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the training set is overly optimistic and, therefore, strongly
 Compare it with InfoA computation {low,medium} (and {high}) since it has the lowest Gini index biased.
v | Dj |  This method adjusts the error rates obtained from the

m
Info A ( D )   Info ( D j ) Info ( D )    pi log 2 ( pi )
j 1 |D| training set by adding a penalty, so as to counter the
i 1
bias incurred

29 34 39

Gain Ratio (Computation) Comparing Attribute Selection Measures Tree Pruning: Repetition and replication
 SplitInfoage= -5/14 *log2(5/14)-4/14 *log2(4/14) -5/14 *log2(5/14)  pruned trees tend to be more compact than their
 Recall that Gainage was calculated as  The three measures, in general, return good results but unpruned counterparts,
5/14*[-2/5*log2(2/5)- 3/5log2(3/5)]+ 4/14*[-4/4*log2(4/4)-0]  Information gain:  they may still be rather large and complex. Decision trees
+ 5/14*[-2/5*log2(2/5)- 3/5log2(3/5)]  biased towards multivalued attributes can suffer from repetition and replication (Figure 8.7),
SplitInfoage comes as 1.5784 Gain ratio: Repetition occurs when an attribute is repeatedly tested
 

Gainratio(age)= gain(age)/SplitInfoage = 0.246/1.5784=0.156 tends to prefer unbalanced splits in which one partition is

 along a given branch of the tree (e.g.“age < 60?,” followed
much smaller than the others by “age < 45?,” and so on).
 Gini index:  In replication, duplicate subtrees exist within the tree.
 gain_ratio(income) = 0.029/1.557 = 0.019  biased to multivalued attributes
 gain_ratio(student) = ??? Gain_ratio(credit_rating) =???
 has difficulty when # of classes is large
 The attribute with the maximum gain ratio is selected as the splitting
tends to favor tests that result in equal-sized partitions
attribute

and purity in both partitions


30 35 40

Gain Ratio for Attribute Selection (C4.5) Other Attribute Selection Measures Tree Pruning: Repetition and replication
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )  CHAID: a popular decision tree algorithm, measure based on χ2 test for
j 1 |D| |D|
independence
 SplitInfo represents the potential information generated by  C-SEP: performs better than info. gain and gini index in certain cases
splitting into v partitions, corresponding to v outcomes on
 G-statistic: has a close approximation to χ2 distribution
Attribute A
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
 For each outcome, it considers no of tuples having that preferred):
outcome with respect to the total no of tuples in D, although
The best tree as the one that requires the fewest # of bits to both (1)
info gain measues the info based on same partitioning

encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
31 36 41

Gini Index (CART, IBM IntelligentMiner) Overfitting and Tree Pruning Tree Pruning: Repetition and replication
 These situations can impede the accuracy and
If a data set D contains examples from n classes, gini index, Overfitting: An induced tree may overfit the training data
n
 

gini(D) is defined as 2 comprehensibility of a decision tree.


 Too many branches, some may reflect anomalies due to
gini ( D )  1   ( p j )
noise or outliers  The use of multivariate splits (splits based on a
j 1 combination of attributes) can prevent these problems.
where pj is the relative frequency of class j in D  Poor accuracy for unseen samples

 If a data set D is split on A into two subsets D1 and D2, the gini  Two approaches to avoid overfitting  Another approach is to use a different form of knowledge
index gini(D) is defined as  Prepruning: Halt tree construction early ̵ do not split a node representation, such as rules, instead of decision trees.
| D1 | |D |
gini A ( D )  gini ( D 1)  2 gini ( D 2 ) if this would result in the goodness measure falling below a
|D | |D | threshold
 Reduction in Impurity:
 Difficult to choose an appropriate threshold
gini( A)  gini( D)  giniA ( D)
 Postpruning: Remove branches from a “fully grown” tree—
 The attribute provides the smallest ginisplit(D) (or the largest get a sequence of progressively pruned trees
reduction in impurity) is chosen to split the node (need to
 Use a set of data different from the training data to
enumerate all the possible splitting points for each attribute)
decide which is the “best pruned tree”
32 37 42
Enhancements to Basic Decision Tree Induction Scalability : BOAT Bayes’ Theorem: Example
 A man is known to speak the truth 2 out of 3 times. He
 Allow for continuous-valued attributes  BOAT: BOAT (Bootstrapped Optimistic Algorithm for throws a die and reports that the number obtained is a
 Dynamically define new discrete-valued attributes that Tree construction) is a decision tree algo with a four. Find the probability that the number obtained is
partition the continuous attribute value into a discrete set of different approach to scalability. actually a four.
intervals Let A be the event: man reports that number is four.
 uses a statistical technique known as “bootstrapping” 
 Handle missing attribute values to create several smaller subsets, each of which fits in  E1 be the event that four is obtained, and E2 as E’1
Assign the most common value of the attribute memory. Then, P(E1) = Probability that four occurs = 1/6.


 Assign probability to each of the possible values Each subset is used to construct a tree, resulting in

 P(E2) = Prob that four does not occur = 1- P(E1) = 5/6.
 Attribute construction several trees.
 P(A|E1)= Probability that man reports four and it is
Create new attributes based on existing ones that are The trees are examined and used to construct a new actually a four = 2/3; P(A|E2) = 1/3.


sparsely represented tree T’, “very close” to the tree, would have been
This reduces fragmentation, repetition, and replication  Bayes’ theorem: prob that number obtained is actually

generated for the whole dataset
a four, P(E1|A)= (1/6 * 2/3)/(1/6 * 2/3 + 5/6 * 1/3)= 2/7
43 48 53

Scalability : BOAT Prediction Based on Bayes’ Theorem


 BOAT can use any attribute selection measure that
selects binary splits & notion of purity of partitions  Given training data X, posteriori probability of a hypothesis H,
 BOAT uses a lower bound on the attribute selection P(H|X), follows the Bayes’ theorem
measure to detect if this “very good” tree T’, is
P ( H | X )  P (X | H )P ( H )  P (X | H ) P (H ) / P (X )
different from the real tree, T. It refines T’ to arrive at T. P (X )
 BOAT usually requires only two scans of D. This is quite  Informally, this can be viewed as
an improvement, even in comparison to traditional posteriori = likelihood x prior/evidence
decision tree algorithms (the basic algo requires one
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
scan per tree level)
among all the P(Ck|X) for all the k classes
 BOAT was found to be two to three times faster than
Practical difficulty: It requires initial knowledge of many
RainForest,

probabilities, involving significant computational cost


 Additional adv: can be used for incremental updates
44 49 54

Classification in Large Databases Chapter 8. Classification: Basic Concepts Classification to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class
Classification—a classical problem extensively studied by
Bayes Classification Methods

 labels, and each tuple is represented by an n-D attribute vector
statisticians and machine learning researchers
X = (x1, x2, …, xn)
 Scalability: Classifying data sets with millions of examples and  Suppose there are m classes C1, C2, …, Cm.
hundreds of attributes with reasonable speed  Classification is to derive the maximum posteriori, i.e., the
 Why is decision tree induction popular? maximal P(Ci|X)
 relatively faster learning speed (than other classification  This can be derived from Bayes’ theorem
methods) P(X | C )P(C )
P(C | X)  i i
 convertible to simple and easy to understand classification i P(X)
rules
 can use SQL queries for accessing databases
 Since P(X) is constant for all classes, only
 comparable classification accuracy with other methods
P(C | X)  P(X | C )P(C )
i i i
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) needs to be maximized
 Builds an AVC-list (attribute, value, class label)

45 50 55

Scalability Framework for RainForest Bayesian Classification: Why? Naïve Bayes Classifier
 A statistical classifier: performs probabilistic prediction, i.e.,  A simplified assumption: attributes are conditionally
 Separates the scalability aspects from the criteria that independent (i.e., no dependence relation between
predicts class membership probabilities
determine the quality of the tree attributes): n
 Foundation: Based on Bayes’ Theorem. P(X | Ci)   P(x | Ci)  P(x | Ci)  P(x | Ci) ... P(x | Ci)
k 1 2 n
Builds an AVC-list: AVC (Attribute, Value, Class_label) k 1

 Performance: A simple Bayesian classifier, naïve Bayesian  This greatly reduces the computation cost: Only counts the
 AVC-set (of an attribute X ) classifier, has comparable performance with decision tree and class distribution
selected neural network classifiers If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
 Projection of training dataset onto the attribute X and 

 Incremental: Each training example can incrementally for Ak divided by |Ci, D| (# of tuples of Ci in D)
class label where counts of individual class label are If Ak is continous-valued, P(xk|Ci) is usually computed based on
increase/decrease the probability that a hypothesis is correct — 
aggregated Gaussian distribution with a mean μ and standard deviation σ
prior knowledge can be combined with observed data
( x  )2
AVC-group (of a node n ) 1 
  Standard: Even when Bayesian methods are computationally g ( x,  ,  )  e 2 2
and P(xk|Ci) is 2 
 Set of AVC-sets of all predictor attributes at the node n intractable, they can provide a standard of optimal decision
making against which other methods can be measured P(X| Ci)  g(xk , Ci ,Ci )
46 51 56

Scalability : AVC Set Bayes’ Theorem: Basics Naïve Bayes Classifier: Training Dataset
M
 Total probability Theorem: P(B)   P(B | A )P( A )
i i age income studentcredit_rating
buys_computer
i 1
<=30 high no fair no
 Bayes’ Theorem: P ( H | X )  P ( X | H ) P ( H )  P ( X | H )  P ( H ) / P ( X ) Class: <=30 high no excellent no
P (X ) C1:buys_computer = ‘yes’ 31…40 high no fair yes
 Let X be a data sample (“evidence”): class label is unknown C2:buys_computer = ‘no’ >40 medium no fair yes
 Let H be a hypothesis that X belongs to class C >40 low yes fair yes
 Classification is to determine P(H|X), (i.e., posteriori probability): the Data to be classified: >40 low yes excellent no
probability that the hypothesis holds given the observed data sample X 31…40 low yes excellent yes
X = (age <=30,
 P(H) (prior probability): the initial probability <=30 medium no fair no
 E.g., X will buy computer, regardless of age, income, …
Income = medium, <=30 low yes fair yes
 P(X): probability that sample data is observed Student = yes >40 medium yes fair yes
 P(X|H) (likelihood): the probability of observing the sample X, given that Credit_rating = Fair) <=30 medium yes excellent yes
the hypothesis holds 31…40 medium no excellent yes
 E.g., Given that X will buy computer, the prob. that X is 31..40, 31…40 high yes fair yes
medium income >40 medium no excellent no
47 52 57
Classifier Evaluation Metrics: Confusion
Naïve Bayes Classifier: An Example Decision Tree: Implementations
age
<=30
<=30
31…40
income studentcredit_rating
high
high
high
no fair
buys_computer

no excellent
no fair
no
no
yes
Matrix
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
 >40
>40
low
low
yes fair
yes excellent
yes
no Confusion Matrix:  Let us refer to the book’s chapter and related slides
P(buys_computer = “no”) = 5/14= 0.357 31…40 low yes excellent yes

for the Decision tree implementation


<=30 medium no fair no
<=30 low yes fair yes Actual class\Predicted class C1 ¬ C1
 Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


31…40
31…40
medium
high
no excellent
yes fair
yes
yes C1 True Positives (TP) False Negatives (FN)
>40 medium no excellent no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 ¬ C1 False Positives (FP) True Negatives (TN)
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
Example of Confusion Matrix:
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 Actual class\Predicted buy_computer buy_computer Total
class = yes = no
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 buy_computer = yes 6954 46 7000
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 buy_computer = no 412 2588 3000
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) Total 7366 2634 10000
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019  Given m classes, an entry, CMi,j in a confusion matrix indicates
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 # of tuples in class i that were labeled by the classifier as class j
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007  May have extra rows/columns to provide totals
Therefore, X belongs to class (“buys_computer = yes”) 58 63 68

Classifier Evaluation Metrics: Accuracy,


Avoiding the Zero-Probability Problem
Error Rate, Sensitivity and Specificity
 Naïve Bayesian prediction requires each conditional prob. be A\P C ¬C Class Imbalance Problem:

non-zero. Otherwise, the predicted prob. will be zero C TP FN P
 One class may be rare, e.g.
n
¬C FP TN N
P ( X | C i)   P ( x k | C i) fraud, or HIV-positive
k 1 P’ N’ All
 Significant majority of the
 Ex. Suppose a dataset with 1000 tuples, income=low (0),  Classifier Accuracy, or negative class and minority of
income= medium (990), and income = high (10) recognition rate: percentage of the positive class
 Use Laplacian correction (or Laplacian estimator) test set tuples that are correctly  Sensitivity: True Positive
 Adding 1 to each case classified recognition rate
Prob(income = low) = 1/1003 Accuracy = (TP + TN)/All  Sensitivity = TP/P

Prob(income = medium) = 991/1003  Error/misclassification rate:  Specificity: True Negative

Prob(income = high) = 11/1003 1 – accuracy, or recognition rate


 The “corrected” prob. estimates are close to their Error rate = (FP + FN)/All  Specificity = TN/N

“uncorrected” counterparts
59 64 69

Classifier Evaluation Metrics:


Naïve Bayes Classifier: Comments Chapter 8. Classification: Basic Concepts
Precision and Recall, and F-measures
 Advantages  Precision: exactness – what % of tuples that the classifier
 Easy to implement labeled as positive are actually positive  Techniques to Improve Classification Accuracy:
 Good results obtained in most of the cases Ensemble Methods
 Disadvantages  Recall: completeness – what % of positive tuples did the
 Assumption: class conditional independence, therefore loss classifier label as positive?  Summary
of accuracy  Perfect score is 1.0
 Practically, dependencies exist among variables  Inverse relationship between precision & recall
 E.g., hospitals: patients: Profile: age, family history, etc.  F measure (F1 or F-score): harmonic mean of precision and
Symptoms: fever, cough etc., Disease: lung cancer, recall,
diabetes, etc.
 Dependencies among these cannot be modeled by Naïve
 Fß: weighted measure of precision and recall
 assigns ß times as much weight to recall as to precision
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
60 65 70

Chapter 8. Classification: Basic Concepts Classifier Evaluation Metrics: Example Ensemble Methods: Increasing the Accuracy

 ….
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
 Model Evaluation and Selection cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)  Ensemble methods
Total 230 9770 10000 96.40 (accuracy)  Use a combination of models to increase accuracy

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%  Combine a series of k learned models, M1, M2, …, Mk, with

the aim of creating an improved model M*


 Popular ensemble methods
 Bagging: averaging the prediction over a collection of

classifiers
 Boosting: weighted vote with a collection of classifiers

 Stacking: combining a set of heterogeneous classifiers

61 66 71

Evaluating Classifier Accuracy:


Model Evaluation and Selection Bagging: Boostrap Aggregation
Holdout & Cross-Validation Methods
 Evaluation metrics: How can we measure accuracy? Other  Holdout method  Analogy: Diagnosis based on multiple doctors’ majority vote
metrics to consider?  Given data is randomly partitioned into two independent sets  Training
 Training set (e.g., 2/3) for model construction  Given a set D of d tuples, at each iteration i, a training set Di of d tuples
 Use validation test set of class-labeled tuples instead of
 Test set (e.g., 1/3) for accuracy estimation
is sampled with replacement from D (i.e., bootstrap)
training set when assessing accuracy  A classifier model Mi is learned for each training set Di
 Random sampling: a variation of holdout
 Methods for estimating a classifier’s accuracy:  Repeat holdout k times, accuracy = avg. of the accuracies
 Classification: classify an unknown sample X
 Holdout method, random subsampling obtained  Each classifier Mi returns its class prediction

Cross-validation  Cross-validation (k-fold, where k = 10 is most popular)  The bagged classifier M* counts the votes and assigns the class with the

 Randomly partition the data into k mutually exclusive subsets,
most votes to X
 Bootstrap each approximately equal size  Prediction: can be applied to the prediction of continuous values by taking
 At i-th iteration, use Di as test set and others as training set
the average value of each prediction for a given test tuple
 Comparing classifiers:
 Leave-one-out: k folds where k = # of tuples, for small sized
 Accuracy
 Confidence intervals data  Often significantly better than a single classifier derived from D

 Cost-benefit analysis and ROC Curves  *Stratified cross-validation*: folds are stratified so that class  For noise data: not considerably worse, more robust

dist. in each fold is approx. the same as that in the initial data  Proved improved accuracy in prediction
62 67 72
Estimating Confidence Intervals:
Boosting Chapter 8. Classification: Basic Concepts Null Hypothesis
 Analogy: Consult several doctors, based on a combination of  Perform 10-fold cross-validation
weighted diagnoses—weight assigned based on the previous  Summary
diagnosis accuracy  Assume samples follow a t distribution with k–1 degrees of
 How boosting works? freedom (here, k=10)
 Weights are assigned to each training tuple
 Use t-test
 A series of k classifiers is iteratively learned
 After a classifier Mi is learned, the weights are updated to  Null Hypothesis: M1 & M2 are the same
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi  If we can reject null hypothesis, then
 The final M* combines the votes of each individual classifier,  we conclude that the difference between M1 & M2 is
where the weight of each classifier's vote is a function of its statistically significant
accuracy
 Boosting algorithm can be extended for numeric prediction  Chose model with lower error rate
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 73 78 83

Adaboost (Adaptive Boosting) Summary (I) Estimating Confidence Intervals: t-test


 If only 1 test set available:
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)  Classification is a form of data analysis that extracts models pairwise comparison
Initially, all the weights of tuples are set the same (1/d) For ith round of 10-fold cross-validation, the same cross

describing important data classes. 
 Generate k classifiers in k rounds. At round i, partitioning is used to obtain err(M1)i and err(M2)i
 Tuples from D are sampled (with replacement) to form a training set  Effective and scalable methods have been developed for decision
Di of the same size  Average over 10 rounds to get
tree induction, Naive Bayesian classification, rule-based and
 Each tuple’s chance of being selected is based on its weight
classification, and many other classification methods.  t-test computes t-statistic with k-1 degrees of
A classification model Mi is derived from Di
freedom:

 Its error rate is calculated using Di as a test set  Evaluation metrics include: accuracy, sensitivity, specificity, where
 If a tuple is misclassified, its weight is increased, o.w. it is decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
precision, recall, F measure, and Fß measure.
error rate is the sum of the weights of the misclassified tuples:  Stratified k-fold cross-validation is recommended for accuracy
If two test sets available: use non-paired (two-
d
error ( M i )  w j  err ( X j ) estimation. Bagging and boosting can be used to increase overall 
j
The weight of classifier Mi’s vote is accuracy by learning and combining a series of individual models. sample) t-test

1  error ( M i )
log where
error ( M i )
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
74 79 84

Estimating Confidence Intervals:


One Common Name used: .632 Bootstrap Summary (II) Table for t-distribution
 Several bootstrap methods, and a common one is .632 bootstrap  Significance tests and ROC curves are useful for model selection.
A data set with d tuples is sampled d times, with replacement, resulting in
There have been numerous comparisons of the different


a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data classification methods; the matter remains a research topic  Symmetric
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
 No single method has been found to be superior over all others  Significance level,
for all data sets e.g., sig = 0.05 or
Repeat the sampling procedure k times, overall accuracy of the model:

5% means M1 & M2
 Issues such as accuracy, training time, robustness, scalability, are significantly
different for 95% of
and interpretability must be considered and can involve trade- population
offs, further complicating the quest for an overall superior  Confidence limit, z
method = sig/2

75 80 85

Estimating Confidence Intervals:


Random Forest (Breiman 2001)
Statistical Significance
Random Forest:
Some More Classifiers to be studied in Are M1 & M2 significantly different?


 Each classifier in the ensemble is a decision tree classifier and is 
 Compute t. Select significance level (e.g. sig = 5%)
generated using a random selection of attributes at each node to
determine the split Coming PPTs  Consult table for t-distribution: Find t value corresponding

 During classification, each tree votes and the most popular class is to k-1 degrees of freedom (here, 9)
returned
 t-distribution is symmetric: typically upper % points of
Two Methods to construct Random Forest:
distribu on shown → look up value for confidence limit

 Forest-RI (random input selection): Randomly select, at each node, F

attributes as candidates for the split at the node. The CART methodology
z=sig/2 (here, 0.025)
is used to grow the trees to maximum size  If t > z or t < -z, then t value lies in rejection region:

 Forest-RC (random linear combinations): Creates new attributes (or  Reject null hypothesis (Null Hypo was:mean error rates
features) that are a linear combination of the existing attributes of M1 & M2 are same)
(reduces the correlation between individual classifiers)
 Conclude: statistically significant difference between M1
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each & M2
split, and faster than bagging or boosting  Otherwise, conclude that any difference is chance
76 81 86

Estimating Confidence Intervals:


Classification of Class-Imbalanced Data Sets Model Selection: ROC Curves
Classifier Models M1 vs. M2
 Class-imbalance problem: Rare positive example but numerous  Suppose we have 2 classifiers, M1 and M2, which one is better?  ROC (Receiver Operating
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. Characteristics) curves: for visual
 Traditional methods assume a balanced distribution of classes  Use 10-fold cross-validation to obtain and comparison of classification models
and equal error costs: not suitable for class-imbalanced data  Originated from signal detection theory
 Typical methods for imbalance data in 2-class classification:  These mean error rates are just estimates of error on the true  Shows the trade-off between the true
 Oversampling: re-sampling of data from positive class population of future data cases positive rate and the false positive rate
The area under the ROC curve is a  Vertical axis
 Under-sampling: randomly eliminate tuples from negative 
What if the difference between the 2 error rates is just measure of the accuracy of the model represents the true
class 
positive rate
 Threshold-moving: moves the decision threshold, t, so that attributed to chance?  Rank the test tuples in decreasing  Horizontal axis rep.
the rare class tuples are easier to classify, and hence, less order: the one that is most likely to the false positive rate
chance of costly false negative errors  Use a test of statistical significance belong to the positive class appears at  The plot also shows a
 Ensemble techniques: Ensemble multiple classifiers
the top of the list diagonal line
introduced above  Obtain confidence limits for our error estimates  The closer to the diagonal line (i.e., the  A model with perfect
 Still difficult for class imbalance problem on multiclass tasks closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
77 82 87
412 Chapter 9 Classification: Advanced Methods 9.5 Lazy Learners (or Learning from Your Neighbors) 423

“So, how does an SVM find the MMH and the support vectors?” Using some “fancy examples of eager learners. Eager learners, when given a set of training tuples, will
math tricks,” we can rewrite Eq. (9.18) so that it becomes what is known as a constrained construct a generalization (i.e., classification) model before receiving new (e.g., test)

Issues Affecting Model Selection


(convex) quadratic optimization problem. Such fancy math tricks are beyond the scope tuples to classify. We can think of the learned model as being ready and eager to classify
of this book. Advanced readers may be interested to note that the tricks involve rewrit- previously unseen tuples.
ing Eq. (9.18) using a Lagrangian formulation and then solving for the solution using Imagine a contrasting lazy approach, in which the learner instead waits until the last
Karush-Kuhn-Tucker (KKT) conditions. Details can be found in the bibliographic notes minute before doing any model construction to classify a given test tuple. That is, when
at the end of this chapter (Section 9.10). given a training tuple, a lazy learner simply stores it (or does only a little minor pro-
If the data are small (say, less than 2000 training tuples), any optimization software cessing) and waits until it is given a test tuple. Only when it sees the test tuple does it
package for solving constrained convex quadratic problems can then be used to find perform generalization to classify the tuple based on its similarity to the stored train-
 Accuracy the support vectors and MMH. For larger data, special and more efficient algorithms
for training SVMs can be used instead, the details of which exceed the scope of this
ing tuples. Unlike eager learning methods, lazy learners do less work when a training
tuple is presented and more work when making a classification or numeric prediction.
book. Once we’ve found the support vectors and MMH (note that the support vectors Because lazy learners store the training tuples or “instances,” they are also referred to as

 classifier accuracy: predicting class label define the MMH!), we have a trained support vector machine. The MMH is a linear class
boundary, and so the corresponding SVM can be used to classify linearly separable data.
instance-based learners, even though all learning is essentially based on instances.
When making a classification or numeric prediction, lazy learners can be compu-
We refer to such a trained SVM as a linear SVM. tationally expensive. They require efficient storage techniques and are well suited to

 Speed “Once I’ve got a trained support vector machine, how do I use it to classify test (i.e.,
new) tuples?” Based on the Lagrangian formulation mentioned before, the MMH can be
implementation on parallel hardware. They offer little explanation or insight into the
data’s structure. Lazy learners, however, naturally support incremental learning. They
rewritten as the decision boundary are able to model complex decision spaces having hyperpolygonal shapes that may

 time to construct the model (training time) d(XT ) =


l
X
yi αi Xi XT + b0 , (9.19)
not be as easily describable by other learning algorithms (such as hyperrectangular
shapes modeled by decision trees). In this section, we look at two examples of lazy
learners: k-nearest-neighbor classifiers (Section 9.5.1) and case-based reasoning classifiers

time to use the model (classification/prediction time)


i=1
(Section 9.5.2).
 where yi is the class label of support vector Xi ; XT is a test tuple; αi and b0 are numeric
parameters that were determined automatically by the optimization or SVM algorithm

 Robustness: handling noise and missing values noted before; and l is the number of support vectors.
Interested readers may note that the αi are Lagrangian multipliers. For linearly sepa-
9.5.1 k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is
rable data, the support vectors are a subset of the actual training tuples (although there

Scalability: efficiency in disk-resident databases


labor intensive when given large training sets, and did not gain popularity until the
will be a slight twist regarding this when dealing with nonlinearly separable data, as we
 shall see in the following).
1960s when increased computing power became available. It has since been widely used
in the area of pattern recognition.
Given a test tuple, X T , we plug it into Eq. (9.19), and then check to see the sign of the

Interpretability
Nearest-neighbor classifiers are based on learning by analogy, that is, by compar-
result. This tells us on which side of the hyperplane the test tuple falls. If the sign is posi-
 tive, then X T falls on or above the MMH, and so the SVM predicts that X T belongs
ing a given test tuple with training tuples that are similar to it. The training tuples are
described by n attributes. Each tuple represents a point in an n-dimensional space. In
to class +1 (representing buys computer = yes, in our case). If the sign is negative, this way, all the training tuples are stored in an n-dimensional pattern space. When given
 understanding and insight provided by the model then X T falls on or below the MMH and the class prediction is −1 (representing
buys computer = no).
an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown tuple. These k training tuples are the k
Notice that the Lagrangian formulation of our problem (Eq. 9.19) contains a dot “nearest neighbors” of the unknown tuple.
 Other measures, e.g., goodness of rules, such as decision tree product between support vector X i and test tuple X T . This will prove very useful for
finding the MMH and support vectors for the case when the given data are nonlinearly
“Closeness” is defined in terms of a distance metric, such as Euclidean distance. The
Euclidean distance between two points or tuples, say, X1 = (x11 , x12 , . . . , x1n ) and X2 =

size or compactness of classification rules


separable, as described further in the next section. (x21 , x22 , . . . , x2n ), is
Before we move on to the nonlinear case, there are two more important things to
note. The complexity of the learned classifier is characterized by the number of support
vectors rather than the dimensionality of the data. Hence, SVMs tend to be less prone
v
u n
88 to overfitting than some other methods. The support vectors are the essential or critical
uX
dist(X1 , X2 ) = t (x1i − x2i )2 . (9.22)
training tuples—they lie closest to the decision boundary (MMH). If all other training i=1

408 Chapter 9 Classification: Advanced Methods 9.3 Support Vector Machines 413 424 Chapter 9 Classification: Advanced Methods

with corresponding output unit values. Similarly, the sets of input values and activation tuples were removed and training were repeated, the same separating hyperplane would In other words, for each numeric attribute, we take the difference between the corre-
values are studied to derive rules describing the relationship between the input layer be found. Furthermore, the number of support vectors found can be used to compute sponding values of that attribute in tuple X1 and in tuple X2 , square this difference,
and the hidden “layer units”? Finally, the two sets of rules may be combined to form an (upper) bound on the expected error rate of the SVM classifier, which is independent and accumulate it. The square root is taken of the total accumulated distance count.
IF-THEN rules. Other algorithms may derive rules of other forms, including M-of-N of the data dimensionality. An SVM with a small number of support vectors can have Typically, we normalize the values of each attribute before using Eq. (9.22). This helps
rules (where M out of a given N conditions in the rule antecedent must be true for the good generalization, even when the dimensionality of the data is high. prevent attributes with initially large ranges (e.g., income) from outweighing attributes
rule consequent to be applied), decision trees with M-of-N tests, fuzzy rules, and finite with initially smaller ranges (e.g., binary attributes). Min-max normalization, for exam-
automata. ple, can be used to transform a value v of a numeric attribute A to v ′ in the range [0, 1]
Sensitivity analysis is used to assess the impact that a given input variable has on a 9.3.2 The Case When the Data Are Linearly Inseparable by computing
network output. The input to the variable is varied while the remaining input variables In Section 9.3.1 we learned about linear SVMs for classifying linearly separable data, but
are fixed at some value. Meanwhile, changes in the network output are monitored. The what if the data are not linearly separable, as in Figure 9.10? In such cases, no straight v − minA
v′ = , (9.23)
knowledge gained from this analysis form can be represented in rules such as “IF X line can be found that would separate the classes. The linear SVMs we studied would maxA − minA
decreases 5% THEN Y increases 8%.” not be able to find a feasible solution here. Now what?
The good news is that the approach described for linear SVMs can be extended to where minA and maxA are the minimum and maximum values of attribute A. Chapter 3

9.3 Support Vector Machines create nonlinear SVMs for the classification of linearly inseparable data (also called non-
linearly separable data, or nonlinear data for short). Such SVMs are capable of finding
nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in input space.
describes other methods for data normalization as a form of data transformation.
For k-nearest-neighbor classification, the unknown tuple is assigned the most com-
mon class among its k-nearest neighbors. When k = 1, the unknown tuple is assigned
In this section, we study support vector machines (SVMs), a method for the classifi- “So,” you may ask, “how can we extend the linear approach?” We obtain a nonlinear the class of the training tuple that is closest to it in pattern space. Nearest-neighbor clas-
cation of both linear and nonlinear data. In a nutshell, an SVM is an algorithm that SVM by extending the approach for linear SVMs as follows. There are two main steps. sifiers can also be used for numeric prediction, that is, to return a real-valued prediction
works as follows. It uses a nonlinear mapping to transform the original training data In the first step, we transform the original input data into a higher dimensional space for a given unknown tuple. In this case, the classifier returns the average value of the
into a higher dimension. Within this new dimension, it searches for the linear opti- using a nonlinear mapping. Several common nonlinear mappings can be used in this real-valued labels associated with the k-nearest neighbors of the unknown tuple.
mal separating hyperplane (i.e., a “decision boundary” separating the tuples of one class step, as we will further describe next. Once the data have been transformed into the “But how can distance be computed for attributes that are not numeric, but nominal
from another). With an appropriate nonlinear mapping to a sufficiently high dimen- new higher space, the second step searches for a linear separating hyperplane in the new (or categorical) such as color?” The previous discussion assumes that the attributes used
sion, data from two classes can always be separated by a hyperplane. The SVM finds this space. We again end up with a quadratic optimization problem that can be solved using to describe the tuples are all numeric. For nominal attributes, a simple method is to
hyperplane using support vectors (“essential” training tuples) and margins (defined by the linear SVM formulation. The maximal marginal hyperplane found in the new space compare the corresponding value of the attribute in tuple X1 with that in tuple X2 . If
the support vectors). We will delve more into these new concepts later. corresponds to a nonlinear separating hypersurface in the original space. the two are identical (e.g., tuples X1 and X2 both have the color blue), then the difference
“I’ve heard that SVMs have attracted a great deal of attention lately. Why?” The first between the two is taken as 0. If the two are different (e.g., tuple X1 is blue but tuple X2
paper on support vector machines was presented in 1992 by Vladimir Vapnik and col- is red), then the difference is considered to be 1. Other methods may incorporate more
leagues Bernhard Boser and Isabelle Guyon, although the groundwork for SVMs has A2 sophisticated schemes for differential grading (e.g., where a larger difference score is
been around since the 1960s (including early work by Vapnik and Alexei Chervonenkis Class 1, y = +1 (buys_computer = yes) assigned, say, for blue and white than for blue and black).
on statistical learning theory). Although the training time of even the fastest SVMs Class 2, y = −1 (buys_computer = no) “What about missing values?” In general, if the value of a given attribute A is missing
can be extremely slow, they are highly accurate, owing to their ability to model com- in tuple X1 and/or in tuple X2 , we assume the maximum possible difference. Suppose
plex nonlinear decision boundaries. They are much less prone to overfitting than other that each of the attributes has been mapped to the range [0, 1]. For nominal attributes,
methods. The support vectors found also provide a compact description of the learned we take the difference value to be 1 if either one or both of the corresponding values of A
model. SVMs can be used for numeric prediction as well as classification. They have are missing. If A is numeric and missing from both tuples X1 and X2 , then the difference
been applied to a number of areas, including handwritten digit recognition, object is also taken to be 1. If only one value is missing and the other (which we will call v ′ ) is
recognition, and speaker identification, as well as benchmark time-series prediction present and normalized, then we can take the difference to be either |1 − v ′ | or |0 − v ′ |
tests. (i.e., 1 − v ′ or v ′ ), whichever is greater.
“How can I determine a good value for k, the number of neighbors?” This can be deter-
A1 mined experimentally. Starting with k = 1, we use a test set to estimate the error rate
9.3.1 The Case When the Data Are Linearly Separable of the classifier. This process can be repeated each time by incrementing k to allow for
To explain the mystery of SVMs, let’s first look at the simplest case—a two-class prob- one more neighbor. The k value that gives the minimum error rate may be selected. In
lem where the classes are linearly separable. Let the data set D be given as (X1 , y1 ), Figure 9.10 A simple 2-D case showing linearly inseparable data. Unlike the linear separable data of general, the larger the number of training tuples, the larger the value of k will be (so
(X2 , y2 ), . . . , (X|D| , y|D| ), where Xi is the set of training tuples with associated class Figure 9.7, here it is not possible to draw a straight line to separate the classes. Instead, the that classification and numeric prediction decisions can be based on a larger portion of
labels, yi . Each yi can take one of two values, either +1 or −1 (i.e., yi ∈ {+1, − 1}), decision boundary is nonlinear. the stored tuples). As the number of training tuples approaches infinity and k = 1, the

9.3 Support Vector Machines 409 414 Chapter 9 Classification: Advanced Methods 9.5 Lazy Learners (or Learning from Your Neighbors) 425

A2 Example 9.2 Nonlinear transformation of original input data into a higher dimensional space. error rate can be no worse than twice the Bayes error rate (the latter being the theoretical
Class 1, y = +1 (buys_computer = yes) Consider the following example. A 3-D input vector X = (x1 , x2 , x3 ) is mapped into minimum). If k also approaches infinity, the error rate approaches the Bayes error rate.
Class 2, y = −1 (buys_computer = no) a 6-D space, Z, using the mappings φ1 (X) = x1 , φ2 (X) = x2 , φ3 (X) = x3 , φ4 (X) = Nearest-neighbor classifiers use distance-based comparisons that intrinsically assign
(x1 )2 , φ5 (X) = x1 x2 , and φ6 (X) = x1 x3 . A decision hyperplane in the new space is equal weight to each attribute. They therefore can suffer from poor accuracy when given
d(Z) = WZ + b, where W and Z are vectors. This is linear. We solve for W and noisy or irrelevant attributes. The method, however, has been modified to incorporate
b and then substitute back so that the linear decision hyperplane in the new (Z) attribute weighting and the pruning of noisy data tuples. The choice of a distance metric
space corresponds to a nonlinear second-order polynomial in the original 3-D input can be critical. The Manhattan (city block) distance (Section 2.4.4), or other distance
space: measurements, may also be used.
Nearest-neighbor classifiers can be extremely slow when classifying test tuples. If D
d(Z) = w1 x1 + w2 x2 + w3 x3 + w4 (x1 )2 + w5 x1 x2 + w6 x1 x3 + b is a training database of |D| tuples and k = 1, then O(|D|) comparisons are required to
classify a given test tuple. By presorting and arranging the stored tuples into search trees,
= w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 z6 + b.
the number of comparisons can be reduced to O(log(|D|). Parallel implementation can
But there are some problems. First, how do we choose the nonlinear mapping to reduce the running time to a constant, that is, O(1), which is independent of |D|.
a higher dimensional space? Second, the computation involved will be costly. Refer to Other techniques to speed up classification time include the use of partial distance
Eq. (9.19) for the classification of a test tuple, X T . Given the test tuple, we have to com- calculations and editing the stored tuples. In the partial distance method, we compute
A1 pute its dot product with every one of the support vectors.3 In training, we have to the distance based on a subset of the n attributes. If this distance exceeds a threshold,
compute a similar dot product several times in order to find the MMH. This is espe- then further computation for the given stored tuple is halted, and the process moves on
cially expensive. Hence, the dot product computation required is very heavy and costly. to the next stored tuple. The editing method removes training tuples that prove useless.
Figure 9.7 The 2-D training data are linearly separable. There are an infinite number of possible We need another trick! This method is also referred to as pruning or condensing because it reduces the total
separating hyperplanes or “decision boundaries,” some of which are shown here as dashed Luckily, we can use another math trick. It so happens that in solving the quadratic number of tuples stored.
lines. Which one is best? optimization problem of the linear SVM (i.e., when searching for a linear SVM in the
new higher dimensional space), the training tuples appear only in the form of dot prod-
ucts, φ(X i ) · φ(X j ), where φ(X) is simply the nonlinear mapping function applied to 9.5.2 Case-Based Reasoning
corresponding to the classes buys computer = yes and buys computer = no, respectively. transform the training tuples. Instead of computing the dot product on the transformed
To aid in visualization, let’s consider an example based on two input attributes, A1 and data tuples, it turns out that it is mathematically equivalent to instead apply a kernel Case-based reasoning (CBR) classifiers use a database of problem solutions to solve
A2 , as shown in Figure 9.7. From the graph, we see that the 2-D data are linearly separa- function, K (X i , X j ), to the original input data. That is, new problems. Unlike nearest-neighbor classifiers, which store training tuples as points
ble (or “linear,” for short), because a straight line can be drawn to separate all the tuples in Euclidean space, CBR stores the tuples or “cases” for problem solving as complex
of class +1 from all the tuples of class −1. K (X i , Xj ) = φ(X i ) · φ(X j ). (9.20) symbolic descriptions. Business applications of CBR include problem resolution for
There are an infinite number of separating lines that could be drawn. We want to find customer service help desks, where cases describe product-related diagnostic problems.
the “best” one, that is, one that (we hope) will have the minimum classification error on CBR has also been applied to areas such as engineering and law, where cases are either
In other words, everywhere that φ(X i ) · φ(X j ) appears in the training algorithm, we can
previously unseen tuples. How can we find this best line? Note that if our data were 3-D technical designs or legal rulings, respectively. Medical education is another area for
replace it with K (X i , Xj ). In this way, all calculations are made in the original input space,
(i.e., with three attributes), we would want to find the best separating plane. Generalizing CBR, where patient case histories and treatments are used to help diagnose and treat
which is of potentially much lower dimensionality! We can safely avoid the mapping—it
to n dimensions, we want to find the best hyperplane. We will use “hyperplane” to refer to new patients.
turns out that we don’t even have to know what the mapping is! We will talk more later
the decision boundary that we are seeking, regardless of the number of input attributes. When given a new case to classify, a case-based reasoner will first check if an iden-
about what kinds of functions can be used as kernel functions for this problem.
So, in other words, how can we find the best hyperplane? tical training case exists. If one is found, then the accompanying solution to that case
After applying this trick, we can then proceed to find a maximal separating hyper-
An SVM approaches this problem by searching for the maximum marginal hyper- is returned. If no identical case is found, then the case-based reasoner will search for
plane. The procedure is similar to that described in Section 9.3.1, although it involves
plane. Consider Figure 9.8, which shows two possible separating hyperplanes and their training cases having components that are similar to those of the new case. Concep-
placing a user-specified upper bound, C, on the Lagrange multipliers, αi . This upper
associated margins. Before we get into the definition of margins, let’s take an intuitive tually, these training cases may be considered as neighbors of the new case. If cases
bound is best determined experimentally.
look at this figure. Both hyperplanes can correctly classify all the given data tuples. Intu- are represented as graphs, this involves searching for subgraphs that are similar to sub-
“What are some of the kernel functions that could be used?” Properties of the kinds of
itively, however, we expect the hyperplane with the larger margin to be more accurate graphs within the new case. The case-based reasoner tries to combine the solutions of
kernel functions that could be used to replace the dot product scenario just described
at classifying future data tuples than the hyperplane with the smaller margin. This is the neighboring training cases to propose a solution for the new case. If incompatibili-
why (during the learning or training phase) the SVM searches for the hyperplane with ties arise with the individual solutions, then backtracking to search for other solutions
the largest margin, that is, the maximum marginal hyperplane (MMH). The associated 3 The dot product of two vectors, X T = (x1T , x2T , . . . , xnT ) and X i = (xi1 , xi2 , . . . , xin ) is x1T xi1 + x2T xi2 may be necessary. The case-based reasoner may employ background knowledge and
margin gives the largest separation between classes. + · · · + xnT xin . Note that this involves one multiplication and one addition for each of the n dimensions. problem-solving strategies to propose a feasible combined solution.

410 Chapter 9 Classification: Advanced Methods 9.4 Classification Using Frequent Patterns 415

A2 A2 have been studied. Three admissible kernel functions are


h
Polynomial kernel of degree h: K (Xi , Xj ) = (Xi · Xj + 1)
2 /2σ 2
Gaussian radial basis function kernel: K (Xi , Xj ) = e −kXi −Xj k
Sigmoid kernel: K (Xi , Xj ) = tanh(κXi · Xj − δ)
in

Small margin
arg

Each of these results in a different nonlinear classifier in (the original) input space.
m
ge

Neural network aficionados will be interested to note that the resulting decision hyper-
Lar

planes found for nonlinear SVMs are the same type as those found by other well-known

Clustering & Learning:


neural network classifiers. For instance, an SVM with a Gaussian radial basis func-
tion (RBF) gives the same decision hyperplane as a type of neural network known as
A1
a radial basis function network. An SVM with a sigmoid kernel is equivalent to a simple
A1
two-layer neural network known as a multilayer perceptron (with no hidden layers).

Basic Clustering Techniques


There are no golden rules for determining which admissible kernel will result in the
Class 1, y = +1 (buys_computer = yes) Class 1, y = +1 (buys_computer = yes)
most accurate SVM. In practice, the kernel chosen does not generally make a large
Class 2, y = −1 (buys_computer = no) Class 2, y = −1 (buys_computer = no)
difference in resulting accuracy. SVM training always finds a global solution, unlike
(a) (b) neural networks, such as backpropagation, where many local minima usually exist
(Section 9.2.3).
So far, we have described linear and nonlinear SVMs for binary (i.e., two-class) clas-
Figure 9.8 Here we see just two possible separating hyperplanes and their associated margins. Which
sification. SVM classifiers can be combined for the multiclass case. See Section 9.7.1 for
one is better? The one with the larger margin (b) should have greater generalization accuracy.
some strategies, such as training one classifier per class and the use of error-correcting
codes.
Getting to an informal definition of margin, we can say that the shortest distance A major research goal regarding SVMs is to improve the speed in training and testing
from a hyperplane to one side of its margin is equal to the shortest distance from the so that SVMs may become a more feasible option for very large data sets (e.g., millions
hyperplane to the other side of its margin, where the “sides” of the margin are parallel of support vectors). Other issues include determining the best kernel for a given data set
to the hyperplane. When dealing with the MMH, this distance is, in fact, the shortest and finding more efficient methods for the multiclass case.
distance from the MMH to the closest training tuple of either class.
A separating hyperplane can be written as
W · X + b = 0,
where W is a weight vector, namely, W = {w1 , w2 , . . . , wn }; n is the number of attributes;
(9.12) 9.4 Classification Using Frequent Patterns
Ref & Acknowledgements:
Frequent patterns show interesting relationships between attribute–value pairs that

1. Dr Anoop Kumar Patel, NIT Kurukshetra


and b is a scalar, often referred to as a bias. To aid in visualization, let’s consider two input occur frequently in a given data set. For example, we may find that the attribute–value
attributes, A1 and A2 , as in Figure 9.8(b). Training tuples are 2-D (e.g., X = (x1 , x2 )), pairs age = youth and credit = OK occur in 20% of data tuples describing AllElectronics
where x1 and x2 are the values of attributes A1 and A2 , respectively, for X. If we think of customers who buy a computer. We can think of each attribute–value pair as an item,
b as an additional weight, w0 , we can rewrite Eq. (9.12) as so the search for these frequent patterns is known as frequent pattern mining or frequent
itemset mining. In Chapters 6 and 7, we saw how association rules are derived from
2. PM Tan, M Steinbach, A. Karpatne, Vipin Kumar, “Introduction
to Data Mining”, Pearson Education, India, Latest Edition
w0 + w1 x1 + w2 x2 = 0. (9.13)
frequent patterns, where the associations are commonly used to analyze the purchas-
Thus, any point that lies above the separating hyperplane satisfies ing patterns of customers in a store. Such analysis is useful in many decision-making
processes such as product placement, catalog design, and cross-marketing.
w0 + w1 x1 + w2 x2 > 0. (9.14) In this section, we examine how frequent patterns can be used for classification. 1
Similarly, any point that lies below the separating hyperplane satisfies Section 9.4.1 explores associative classification, where association rules are generated
from frequent patterns and used for classification. The general idea is that we can search
w0 + w1 x1 + w2 x2 < 0. (9.15) for strong associations between frequent patterns (conjunctions of attribute–value

9.3 Support Vector Machines 411 422 Chapter 9 Classification: Advanced Methods

Learning using Clustering


The weights can be adjusted so that the hyperplanes defining the “sides” of the margin
can be written as Mine Select
Two-step
H1 : w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, (9.16)
Data set Frequent patterns Discriminative patterns
H2 : w0 + w1 x1 + w2 x2 ≤ −1 for yi = −1. (9.17) (a)
That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that falls
Transform Search
on or below H2 belongs to class −1. Combining the two inequalities of Eqs. (9.16) and Direct
(9.17), we get
Data set Compact tree Discriminative patterns

• Supervised learning: discover patterns in the data that


yi (w0 + w1 x1 + w2 x2 ) ≥ 1, ∀i. (9.18) (b)

Any training tuples that fall on hyperplanes H1 or H2 (i.e., the “sides” defining the

relate data attributes with a target (class) attribute.


margin) satisfy Eq. (9.18) and are called support vectors. That is, they are equally close Figure 9.13 A framework for frequent pattern–based classification: (a) a two-step general approach
to the (separating) MMH. In Figure 9.9, the support vectors are shown encircled with versus (b) the direct approach of DDPMine.
a thicker border. Essentially, the support vectors are the most difficult tuples to classify

– These patterns are then utilized to predict the


and give the most information regarding classification.
To improve the efficiency of the general framework, consider condensing steps 1 and
From this, we can obtain a formula for the size of the maximal margin. The distance
1 2 into just one step. That is, rather than generating the complete set of frequent patterns,
from the separating hyperplane to any point on H1 is ||W|| , where ||W|| is the Euclidean
it’s possible to mine only the highly discriminative ones. This more direct approach

values of the target attribute in future data



norm of W, that is, W · W.2 By definition, this is equal to the distance from any point is referred to as direct discriminative pattern mining. The DDPMine algorithm follows
2
on H2 to the separating hyperplane. Therefore, the maximal margin is ||W|| . this approach, as illustrated in Figure 9.13(b). It first transforms the training data into

instances.
a compact tree structure known as a frequent pattern tree, or FP-tree (Section 6.2.4),
which holds all of the attribute–value (itemset) association information. It then searches
A2
for discriminative patterns on the tree. The approach is direct in that it avoids generat-
Class 1, y = +1 (buys_computer = yes)
ing a large number of indiscriminative patterns. It incrementally reduces the problem

• Unsupervised learning: The data have no target


Class 2, y = −1 (buys_computer = no)
by eliminating training tuples, thereby progressively shrinking the FP-tree. This further
speeds up the mining process.

attribute.
By choosing to transform the original data to an FP-tree, DDPMine avoids gener-
ating redundant patterns because an FP-tree stores only the closed frequent patterns.
gin

By definition, any subpattern, β, of a closed pattern, α, is redundant with respect to


ar
m

– We want to explore the data to find some intrinsic


α (Section 6.1.2). DDPMine directly mines the discriminative patterns and integrates
ge
Lar

feature selection into the mining framework. The theoretical upper bound on infor-
mation gain is used to facilitate a branch-and-bound search, which prunes the search

structures in them.
space significantly. Experimental results show that DDPMine achieves orders of mag-
nitude speedup over the two-step approach without decline in classification accuracy.
DDPMine also outperforms state-of-the-art associative classification methods in terms
A1
of both accuracy and efficiency.

Figure 9.9 Support vectors. The SVM finds the maximum separating hyperplane, that is, the one with
maximum distance between the nearest training tuples. The support vectors are shown with
a thicker border.
9.5 Lazy Learners (or Learning from Your Neighbors)
2
The classification methods discussed so far in this book—decision tree induction,
2 If
√ q Bayesian classification, rule-based classification, classification by backpropagation,
W = {w1 , w2 , . . . , wn }, then W ·W = w12 + w22 + · · · + wn2 .
support vector machines, and classification based on association rule mining—are all
What is clustering for? What is a natural grouping among these objects? Major Clustering Approaches (I)
• Partitioning/ Flat approach:
• Let us see some real-life examples
– Usually start with a random (partial) partitioning
• Example 1: groups people of similar sizes together to
make “small”, “medium” and “large” T-Shirts. – Refine it iteratively
– Tailor-made for each person: too expensive – Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
Clustering is subjective errors
• Hierarchical approach:
according to their similarities
– Bottom-up, agglomerative
– To do targeted marketing.
– (Top-down, divisive)
– Create a hierarchical decomposition of the set of data
3 8 (or objects) using some criterion
Simpson's Family School Employees Females Males

What is clustering for? What Do We Need For Clustering? * Alternate Approaches (II)
1. Proximity Measure, either • Density-based approach:
• Example 3: Given a collection of text documents, we – Similarity measure : large if are similar
want to organize them according to their content – Dissimilarity measure (or distance) measure d small if – Based on connectivity and density functions
similarities, are similar
– Typical methods: DBSACN, OPTICS, DenClue
– To produce a topic hierarchy Large d, Small s Larger s, Small d
• In fact, clustering is one of the most utilized data • Model-based:
mining techniques. – A model is hypothesized for each of the clusters and
2. Criteria Function to
– It has a long history, and used in almost every field, tries to find the best fit of that model to each other
Evaluate a Clustering
e.g., medicine, psychology, botany, sociology,
– Typical methods: EM, SOM, COBWEB
biology, archeology, marketing, insurance, 3. Algorithm to Compute
libraries, etc. Clustering • Grid-based:
– In recent years, due to the rapid increase of online – based on a multiple-level granularity structure
documents, text clustering becomes important.
4

What is Clustering? What is Similarity? Two Common Clustering Algos


The quality or state of being similar; likeness; resemblance; as, a similarity of features.
It is also called unsupervised learning, sometimes called Webster's Dictionary • Partitional algorithms: Construct various partitions and then
classification by statisticians and sorting by evaluate them by some criterion (we will see an example called BIRCH)
psychologists and segmentation by people in marketing • Hierarchical algorithms: Create a hierarchical decomposition of
the set of objects using some criterion
• Organizing data into classes such that there is
• high intra-class similarity Partitional
Similarity is hard
to define, but… Hierarchical
• low inter-class similarity “We know it when
we see it”
• Finding the class labels and the number of classes
The real meaning
directly from the data (in contrast to classification). of similarity is a
philosophical
• So, it’s a method of data exploration – a way of looking question. We will
for patterns or structure in the data that are of interest take a more
pragmatic
• More informally, finding natural groupings among 5 approach.
10 15
objects.

Aspects of clustering Contd… K-means (Partitional) Clustering: Overview


Always
Clustering Evaluation * Quality is
1. Intra-cluster cohesion (compactness): • Choose a number of clusters k
• A clustering algorithm dependent

– Cohesion measures how near the data points


on both
• Initialize cluster centers 1,… k i.e. Partition objects
– Partitional clustering similarity
in a cluster are to the cluster centroid. measure into k nonempty subsets
– Hierarchical clustering and
– Sum of squared error (SSE) is a commonly clustering • For each data point, compute the cluster center it is
–… implement closest to (using some distance measure) and assign the
used measure. ation
• A distance (similarity, or dissimilarity) function 2. Inter-cluster separation (isolation):
data point to this cluster
• Clustering quality – Separation means that different cluster centroids should • Re-compute cluster centers (mean of data points in
cluster)
– Inter-clusters distance  maximized be far away from one another.
• In most applications, expert judgments are still the key
• Stop when there are no new re-assignments
– Intra-clusters distance  minimized
• The quality of a clustering result depends on the How Many Clusters? *
algorithm, the distance function, and the – Fix the number of clusters to k
application. 6
– Find the best clustering according to the criteria
function (no. of clusters may vary)

What is a natural grouping among these objects?


Similarity Hierarchical Clustering: Overview
• Similarity measurements are used to determine the degree • Use distance matrix as clustering criteria. This
of “similarity” between a pair of entities method does not require the number of clusters k as an
• Different types: input, but needs a termination condition
– Association coefficients: Based on common features
that exist (or do not exist) between a pair of entities Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
• Most common type of similarity measurement (AGNES)
a
– Distance measures: Measure of the degree of ab
dissimilarity between entities. Considers each entity as b
abcde
vector. c
cde
• Distance can be based on simple numerical d
de
resemblance or Correlation based. e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
7
Other Distinctions Between Sets of Clusters Pedro (Portuguese/Spanish)
Petros (Greek), Peter (English), Piotr (Polish),
Peadar (Irish), Pierre (French), Peder (Danish),
• Exclusive versus non-exclusive Peka (Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
– In non-exclusive clusterings, points may belong to
multiple clusters.
• Can belong to multiple classes or could be ‘border’
points
– Fuzzy clustering (one type of non-exclusive)
• In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the
data
23 28

Desirable Properties of a Clustering Algorithm Hierarchical Agglomerative Clustering


• start with every data point in a separate cluster
• Scalability (in terms of both time and space) • We keep merging the most similar pairs of
data points/clusters until we have one big
• Ability to deal with different data types cluster left
• This is called a bottom-up or agglomerative
• Minimal requirements for domain knowledge to Hierarchical Clustering: Details method
determine input parameters • This produces a binary tree or dendrogram
• The final cluster is the root and each data item
• Able to deal with noise and outliers is a leaf
• Insensitive to order of input records • The height of the bars indicate how close the
items are
• Incorporation of user-specified constraints • A clustering of the data objects is obtained by
cutting the dendrogram at the desired level,
• Interpretability and usability then each connected component forms a
19 24 cluster

Types of Clusters Summarizing Similarity Measurements: Dendogram Hierarchal clustering can sometimes show
• In order to better appreciate and evaluate the examples given in the patterns that are meaningless or spurious
• Well-separated clusters early part of this talk, we will now introduce the dendrogram.
• Dendogram: shows the hierarchical relationship between objects
• For example, in this clustering, the tight grouping of Australia,
Terminal Branch Root
The similarity between two objects in a Anguilla, St. Helena etc is meaningful, since all these countries are
• Prototype-based clusters Internal Node
Internal Branch
dendrogram is represented as the height of former UK colonies.
Leaf the lowest internal node they share.
• However the tight grouping of Niger and India is completely
• Contiguity-based clusters spurious, there is no connection between the two.

• Density-based clusters

• Described by an Objective Function


25 St. Helena &
South Georgia &
South Sandwich
Serbia &
Montenegro 30
AUSTRALIA Dependencies ANGUILLA Islands U.K. (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL

Types of Clusters: Well-Separated Living Species as a Hierarchy… We can look at the dendrogram to determine the “correct” number of
clusters. In this case, the two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely this clear cut, unfortunately)
• Well-Separated Clusters:
– A cluster is a set of points such that any point in
a cluster is closer (or more similar) to every other
point in the cluster than to any point not in the
cluster.

3 well-separated clusters
26 31
(Bovine:0.69395, (Spider Monkey 0.390, (Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268,
Human:0.11927):0.08386):0.06124):0.15057):0.54939);

Types of Clusters: Prototype-Based A Demonstration of Hierarchical Clustering using String Edit Distance One potential use of a dendrogram is to detect outliers
Pedro (Portuguese)
• Prototype-based Petros (Greek), Peter (English), Piotr (Polish), Peadar

– A cluster is a set of objects such that an object in


(Irish), Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian Alternative), The single isolated branch is suggestive of a
Petr (Czech), Pyotr (Russian)
data point that is very different to all others
a cluster is closer (more similar) to the prototype Cristovao (Portuguese)
or “center” of a cluster, than to the center of any Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
other cluster (Scandinavian), Krystof (Czech), Christopher (English)

– The center of a cluster is often a centroid, the Miguel (Portuguese)


average of all the points in the cluster, or a Michalis (Greek), Michael (English), Mick (Irish!)

medoid, the most “representative” point of a


cluster

Outlier
4 center-based clusters
27 32
Bottom-Up (agglomerative): Starting with each
(How-to) Hierarchical Clustering item in its own cluster, find the best pair to merge into a
new cluster. Repeat until all clusters are fused together. How do we measure similarity?
The number of dendrograms with n Since we cannot test all possible trees Clustering is obtained by cutting the dendrogram at the
leafs = (2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all desired level: each connected component forms a cluster
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting Consider all
3 3 Choose
with each item in its own cluster, find possible
4 15
… the best
5
...
105

the best pair to merge into a new merges…
Peter Piotr
10 34,459,425
cluster. Repeat until all clusters are
fused together.
Consider all
Choose
Top-Down (divisive): Starting with all possible
the best
the data in a single cluster, consider merges… …
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides. Consider all Choose
possible … the best
33 merges… 38 0.23 3 342.7 43

We begin with a distance Distance between Clusters X


X
Distance functions
matrix which contains the • Single link(nearest neighbor) : smallest distance between an
distances between every pair element in one cluster and an element in the other, i.e. dist(Ki,
of objects in our database. Kj) = min(tip, tjq) • Key to clustering. “similarity” and
– The minimum of all pairwise distances between points in
the two clusters
“dissimilarity” can also commonly used terms.
– Tends to produce long, “loose” clusters • There are numerous distance functions for
• Complete link (farthest neighbor): largest distance between
– Different types of data
0 8 8 7 7
an element in one cluster and an element in the other, i.e., • Numeric data
dist(Ki, Kj) = max(tip, tjq) • Nominal data
0 2 4 4
– The maximum of all pairwise distances between points in
the two clusters – Different specific applications
D( , ) = 8
0 3 3
– Tends to produce very tight clusters
0 1

D( , ) = 1 0
34 44

Bottom-Up (agglomerative):
Starting with each item in its own
Distance between Clusters X X

Distance functions for numeric attributes


cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together. • Average: avg distance between an element in one cluster and an • Most commonly used functions are
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
– Euclidean distance and
• Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj) – Manhattan (city block) distance
• Medoid: distance between the medoids of two clusters, i.e., • We denote distance with: dist(xi, xj), where xi
dist(Ki, Kj) = dist(Mi, Mj)
and xj are data points (vectors)
– Medoid: a chosen, centrally located object in the cluster
– What is difference between centroid and medoid? • They are special cases of Minkowski distance.
• Wards Linkage: In this method, we try to minimize the variance h is positive integer. 1
of the merged clusters
dist(xi , x j )  ((xi1  xj1)h  (xi2  xj 2 )h ... (xir  xjr )h )h
Consider all Choose
possible … the best
merges… 35 45

Bottom-Up (agglomerative):
Starting with each item in its own Euclidean distance and Manhattan distance
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together. • If h = 2, it is the Euclidean distance
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2
Single linkage
• If h = 1, it is the Manhattan distance
dist (xi , x j ) | xi1  x j1 |  | xi 2  x j 2 | ... | xir  x jr |
Consider all 7

Choose 25

possible
… the best
6

merges… 20

• Weighted Euclidean distance


5

15
4

3 10
dist (xi , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2
Consider all Choose 2
5

possible … the best 1

merges… 36 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7 0
5 14 23 7 4 12 19 21 24 15 16 18 1 3 8
41
9 29 2 10 11 20 28 17 26 27 25 6 13 22 30
46
Average linkage Wards linkage

Bottom-Up (agglomerative):
Starting with each item in its own Summary of Hierarchal Clustering Methods Squared distance and Chebychev distance
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together. • Squared Euclidean distance: to place
• No need to specify the number of clusters in progressively greater weight on data points
advance. that are further apart.
Consider all
Choose • Hierarchal nature maps nicely onto human intuition
possible dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2
merges… … the best
for some domains
• They do not scale well: time complexity of at least
Consider all
Choose
O(n2), where n is the number of total objects. • Chebychev distance: one wants to define two
possible
merges… … the best • Like any heuristic search algorithms, local optima data points as "different" if they are different
are a problem. on any one of the attributes.
• Interpretation of results is (very) subjective. dist (xi , x j )  max(| xi1  x j1 |, | xi 2  x j 2 |, ..., | xir  x jr |)
Consider all Choose
possible … the best
merges… 37 42 47
Distance functions for binary and Nominal attributes
nominal attributes
• Binary attribute: has two values or states but
• Nominal attributes: with more than two
no ordering relationships, e.g.,
– Gender: male and female.
states or values. Partitional Clustering
– the commonly used distance measure is also
• We use a confusion matrix to introduce the based on the simple matching method.
distance functions/measures. – Given two data points xi and xj, let the number Details
• Let the ith and jth data points be xi and xj of attributes be r, and the number of values that
(vectors) match in xi and xj be q.
r q
dist (xi , x j ) 
48 53
r 58

Confusion matrix Distance function for text documents Partitional Clustering


Data Point xj • A text document consists of a sequence of sentences and • Nonhierarchical, each instance is placed in
1 0 each sentence consists of a sequence of words. exactly one of K nonoverlapping clusters.
1 a b • To simplify: a document is usually considered a “bag” of
Data Point xi words in document clustering. • Since only one set of clusters is output, the user
0 c d normally has to input the desired number of
– Sequence and position of words are ignored.

a : the number of attributes with the value of 1 for both data points • A document is represented with a vector just like a clusters K.
normal data point.
b : the number of attributes for which xif=1 and xjf=0 where xif (xjf)
is the value of the fth attribute of the data point xi (xj) • It is common to use similarity to compare two documents
rather than distance.
c : the number of attributes for which xif=0 and xjf=1
– The most commonly used similarity function is the cosine
d : the number of attributes with the value of 1 for both data points similarity. We will study this later.

49 54 59

Symmetric binary attributes A generic technique for measuring similarity K-means clustering
To measure the similarity between two objects,
• A binary attribute is symmetric if both of its transform one of the objects into the other, and • K-means is a partitional clustering algorithm
states (0 and 1) have equal importance, and measure how much effort it took. The measure • Let the set of data points (or instances) D be
carry the same weights, e.g., male and female of effort becomes the distance measure.
{x1, x2, …, xn},
of the attribute Gender The distance between Patty and Selma. where xi = (xi1, xi2, …, xir) is a vector in a real-
Change dress color, 1 point
• Distance function: Simple Matching Change earring shape, 1 point valued space X  Rr, and r is the number of
Change hair part, 1 point
Coefficient, proportion of mismatches of their D(Patty,Selma) = 3 attributes (dimensions) in the data.
values bc • The k-means algorithm partitions the given
The distance between Marge and Selma.
dist (xi , x j )  Change dress color, 1 point data into k clusters.
abcd Add earrings, 1 point This is called the “edit
Decrease height, 1 point distance” or the – Each cluster has a cluster center, called centroid.
Take up smoking, 1 point “transformation distance”
Lose weight, 1 point – k is specified by the user
50
D(Marge,Selma) = 5 55 60

Edit Distance Example


Symmetric binary attributes: example It is possible to transform any string Q into
How similar are the names
“Peter” and “Piotr”? K-means algorithm
Assume the following cost function
string C, using only Substitution, Insertion Substitution 1 Unit Given k & dataset D, k-means algorithm works as:
and Deletion. Insertion 1 Unit
Assume that each of these operators has a Deletion 1 Unit 1. Randomly choose k data points (seeds) to be the initial
cost associated with it.
D(Peter,Piotr) is 3
centroids, cluster centers
The similarity between two strings can be 2. Repeat
Peter 3. for each point x do
defined as the cost of the cheapest
transformation from Q to C.
Note that for now we have ignored the issue of how we can find this cheapest 4. Compute distance from x to each centroid
transformation Substitution (i for e)
5. Assign each data point to the closest centroid
Piter 6. end for
Insertion (o)
7. Re-compute the centroids using the current cluster
Pioter memberships.
Deletion (e) 8. Until the stopping criteria is met
51 Piotr 56 61

Asymmetric binary attributes Stopping/convergence criterion


1. no (or minimum) re-assignments of data
• Asymmetric: if one of the states is more points to different clusters,
important or more valuable than the other.
2. no (or minimum) change of centroids, or
– By convention, state 1 represents the more
important state, which is typically the rare or
3. minimum decrease in the sum of squared
infrequent state.
error (SSE), k
– Jaccard coefficient is a popular measure
SSE  
j 1
xC j
dist (x, m j ) 2 (1)

bc
dist (xi , x j )  – Ci is the jth cluster, mj is the centroid of cluster Cj
abc
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
– We can have some variations, adding weights and centroid mj.
52 57 62
K-means Clustering: Step 1 An example Strengths of k-means
Algorithm: k-means, Distance Metric: Euclidean Distance
5 • Strengths:
– Simple: easy to understand and to implement
4 – Efficient: Time complexity: O(tkn),
k1 where n is the number of data points,
3 k is the number of clusters, and
t is the number of iterations.
2
k2 – Since both k and t are small. k-means is considered a linear
+ algorithm.
1 + • K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
k3
0 The global optimum is hard to find due to complexity.
0 1 2 3 4 5

63 68 73

K-means Clustering: Step 2 An example (cont …) Weaknesses of k-means


Algorithm: k-means, Distance Metric: Euclidean Distance
5
• The algorithm is only applicable if the mean is
4 defined.
k1 – For categorical data, k-mode - the centroid is
3 represented by most frequent values.

2
k2 • The user needs to specify k.
• The algorithm is sensitive to outliers
1 – Outliers are data points that are very far away from
k3 other data points.
0
0 1 2 3 4 5
– Outliers could be errors in the data recording or
some special data points with very different values.
64 69 74

K-means Clustering: Step 3 Weaknesses of k-means: Problems with


An Example Distance Function outliers
Algorithm: k-means, Distance Metric: Euclidean Distance
5 • Compute distance using Euclidean or any other
suitable function
4
k1
• Find mean of the cluster using
3

k3 where is the number of data points in cluster Cj.


1 k2 The distance from one data point xi to a mean
(centroid) mj is computed with Euclidean Distance
0
0 1 2 3 4 5

65 70 75

K-means Clustering: Step 4 A disk version of k-means Weaknesses of k-means: To deal with outliers
Algorithm: k-means, Distance Metric: Euclidean Distance
5 • K-means can be implemented with data on disk • One method is to remove some data points in the
– In each iteration, it scans the data once. clustering process that are much further away from the
4
k1 – as the centroids can be computed incrementally centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
3 • It can be used to cluster large datasets that do a few iterations and then decide to remove them.
not fit in main memory • Another method is to perform random sampling. Since in
2
• We need to control the number of iterations sampling we only choose a small subset of the data points,
k3 the chance of selecting an outlier is very small.
1 k2 – In practice, a limit is set (< 50). – Assign the rest of the data points to the clusters by distance or
• Not the best method. There are other scale-up similarity comparison, or classification
0
0 1 2 3 4 5
algorithms, e.g., BIRCH.
66 71 76

K-means Clustering: Step 5 A disk version of k-means (cont …) Weaknesses of k-means (cont …)
Algorithm: k-means, Distance Metric: Euclidean Distance
5 • The algorithm is sensitive to initial seeds.
expression in condition 2

4
k1
3

k2
1 k3

0
0 1 2 3 4 5

expression in condition 1 67 72 77
Weaknesses of k-means (cont …) Bisecting K-means: Example K-means summary
• If we use different seeds: good results • Despite weaknesses, k-means is still the most
There are some popular algorithm due to its simplicity,
methods to help
choose good seeds efficiency and
– other clustering algorithms have their own lists of
weaknesses.
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some
specific types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
78
clusters! 88

Solutions to Initial Centroids Problem Some more weaknesses of k-means


• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
• Multiple runs
– Helps, but probability is not on your side
• Use some strategy to select the k initial
centroids and then select among these initial
centroids +
– Select most widely separated
• K-means++ is a robust way of doing this selection
– Use hierarchical clustering to determine initial
centroids
• Bisecting K-means
– Not as susceptible to initialization issues

84 89

CSC380: Principles of Data Science

K-means++
Class Notes | Clustering - K Means
Some more weaknesses of K-means: Differing Sizes
• This approach can be slower than random initialization, but very Table Of Contents

consistently produces better results in terms of SSE


Supervised versus Unsupervised Learning 1

– The k-means++ algorithm guarantees an approximation ratio What is the Key Difference Between Unsupervised Learning versus Supervised Learning?
How can we view the task of discrete binning/grouping in unsupervised and unsupervised
1

O(log k) in expectation, where k is the number of centers learning setups?

Clustering
1

• To select a set of initial centroids, C, perform the following What is a Cluster?


What is Clustering?
1
1
What are the Different Types of Clustering? 2
1. Select an initial point at random to be the first centroid K Means 2
What is the Basic Idea behind K-means? 2
2. For k – 1 steps What is the 1-of-K Coding Scheme?
What is the Objective Function in K-Means Algorithm?
2
2

2a. For each of the N points, xi,1 ≤ i≤ N, find the minimum squared
What is Distortion Measure? 3
What is the K-Means algorithm? ( More Formalised Version) 3

distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < Convergence in K Means
What is Convergence in K-Means?
3
3

k,i.e., d2( Cj, xi ) Is Convergence Guaranteed in K-Means?


Minima Issues in K-Means
3
4
What are some reasons why a cluster may have just One Point? 5

Original Points K-means (3 Clusters)


2b. Randomly select a new centroid by choosing a point with The Number Of Topics for K Means
How can we choose a number of Topics for K-Means?
5
5
d2( Cj, xi ) What is the KMeans ++ Algorithm? 6

probabilityproportional to is K Medoids 6

d2 ( C , x )
j i Assumptions made by K Means 7

Implementation of K Means 7

End For

Supervised versus Unsupervised Learning

Bisecting K-means Some more weaknesses of K-means: Differing Density What is the Key Difference Between Unsupervised Learning versus Supervised Learning?

Supervised Learning: Learning under supervision, We have a full set of labeled data for learning.

• Bisecting K-means algorithm


Unsupervised Learning: They analyze and cluster unlabeled data sets using algorithms that
discover hidden patterns in data without the need for human intervention, Unlabeled Data.

– Variant of K-means that can produce a partitional


Read more at Nvidia Blog, IBM Blog

How can we view the task of discrete binning/grouping in unsupervised and unsupervised

or a hierarchical clustering learning setups?

Supervised Learning: Classification

• Basic Concept: Split the set of all points into 2 Unsupervised Learning: Clustering

clusters; select one of these clusters to split and so Clustering

on, until K clusters have been produced.


What is a Cluster?

A Cluster can be thought of as comprising a group of data points whose inter-point distances are

• Which Cluster to Split? many ways:- choose the


small compared with the distances to points outside of the cluster.

largest cluster at each step, the one with the largest What is Clustering?

SSE, or a criterion based on both size and SSE.


The grouping of objects such that objects in the same cluster are more similar to each other than

K-means (3 Clusters)
they are to objects in another cluster.[1]
Original Points
• As bisection is locally optimal, it's better to refine
Or

The task of finding an assignment of data points to clusters, as well as a set of vectors {μk}, such that

the resulting clusters by using their centroid as the


the sum of the squares of the distances of each data point to its closest vector μk ( or another
similarity measure) is a minimum.

initial centroid for the standard K-means algo Read more at (1) Nvidia Blog Google Developers Blog

CSC380: Principles of Data Science 1

What are the Different Types of Clustering?

Bisecting K-means: Algo Weaknesses of K-means: Non-globular Shapes


Clean and simple explanation with diagrams at Google Developers Machine Learning Course -
Clustering Algorithms

1. Initialize the list of clusters to obtain the cluster consisting of K Means

all points Based on Chapter 9: Pattern Recognition and Machine Learning- Christopher M Bishop

2. Repeat
What is the Basic Idea behind K-means?

● Assign (Random) Cluster Centroids


3. Remove a cluster the list of clusters based on criterion ● Until Convergence:
○ Cluster Assignment Step

4. for i=1 to number of trials do //do several "trial" bisections


○ Re-assigning Centroid Step

A simple gif visualization at Introduction to K-Means Clustering in Python with scikit-learn

5. bisect the selected cluster using K-means Video Recommendation: Andrew NG - Machine Learning Course - This Lecture

6. end for What is the 1-of-K Coding Scheme?

7. Select the two clusters from the bisection with the lowest For each data point xn, we introduce a corresponding set of binary indicator variables rnk ∈ { 0, 1},

total SSE where k = 1, . . . , K describing which of the K clusters the data point xn is assigned to, so that if data
point xn is assigned to cluster k then rnk = 1, and rnj = 0 for j = k.

8. Add these two clusters to the list of clusters


Original Points K-means (2 Clusters) What is the Objective Function in K-Means Algorithm?

9. Until the list of clusters contains K clusters


Also called the Distortion Measure, this represents the sum of the squares of the distances of each
data point to its assigned vector μk. N is range of Data Points, and K is range of Topics/Classes.

Our goal is to find values for the {rnk} and the {μk} so as to minimize J.

CSC380: Principles of Data Science 2


Artificial Neural
Networks

References and Acknowledgements:


Dr Sudeshna Sarkar, IIT Kharagpur
Dr Kuldeep, Dept of Computer Engg, NIT Kurukshetra
SN Sivanandam & SN Deepa, book titled “Principles of Soft Computing”,
Wiley India.

Learning in Neural Nets


Learning Tasks

Supervised Unsupervised
Data: Data:
Labeled examples Unlabeled examples
(input , desired output) (different realizations of the
input)
Tasks:
classification Tasks:
pattern recognition clustering
Regression content addressable memory
Decision Tree,
kNN, SVM NN models:
NN models: self-organizing maps (SOM)
perceptron Hopfield networks
feed-forward NN
Deep Learning

Elements of Neural Network

Neuron

x1 w1 z  x 1 w1  x 2 w 2    x K w K  w 0

x2 w2
z
 h(z)
wK


xK weights
Activation
function
w0
bias

Single Perceptron
x1 w1 z  x 1 w1  x 2 w 2    x K w K  w 0

x2 w2
z
 h(z)
wK


xK
w0 h(z)=
1 if  w x>0i i
0 otherwise
•The quantity w0 is a threshold that the weighted combination of inputs
must surpass in order for the perceptron to output 1.
•To simplify notation, we imagine an additional constant input x0 = -1

Representational Power of Perceptrons

x2 x2
+
+ + -
+ - -
x1 x1
+ - +
-
-
Representable by a perceptron NOT representable by a perceptron

Perceptron is a Linear Threshold Unit (LTU).

Perceptron Training Rule


wi = wi + wi
wi =  (t - o) Samplei
t is the target value
o is the perceptron output
 is a small constant (e.g. 0.1) called learning rate

•If the output is correct (t=o) the weights wi are not changed
• If the output is incorrect (to) the weights wi are changed such that
the output of the perceptron for the new weights is closer to t.
• The algorithm converges to the correct classification
• if the training data is linearly separable
• and  is sufficiently small

What is Distortion Measure? X0

X1

X2

•What are the weight values?


Training Perceptrons

•Initialize with random weight values


W1 = ?
W0 = ?

W2 = ?
th= 0.0
For AND
X1 X2 y
0 0
0 1
1 0
1 1
0
0
0
1

Training Perceptrons
For AND
X0 X1 X2 y
W0 = -0.3
0 0 0
0 1 0
X1 th= 0.0
W1 = 0.5 1 0 0
1 1 1
W2 = -0.4
X2

X0 X1 X2 Summation Output
1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0
8

Weight Updation
• W0= -0.3 + [(0-0)1+(0-0)1+(0-1)1+(1-0)1]= -0.3
• W1= 0.5 + [(0-0)0+(0-0)0+(0-1)1+(1-0)1]= 0.5
• W2= -0.4 + [(0-0)0+(0-0)1+(0-1)0+(1-0)1]= 0.6

Same as above question


X0 X1 X2 Summation Output
1 0 0 (-1*0.3) + (0*0.5) + (0*0.6) = -0.3 0
1 0 1 (-1*0.3) + (0*0.5) + (1*0.6) = 0.3 1
1 1 0 (-1*0.3) + (1*0.5) + (0*0.6) = 0.2 1
1 1 1 (-1*0.3) + (1*0.5) + (1*0.6) = 0.8 1

Weight Updation
• W0= -0.3 + [(0-0)1+(0-1)1+(0-1)1+(1-1)1]= -2.3
• W1= 0.5 + [(0-0)0+(0-1)0+(0-1)1+(1-1)1]= -0.5
• W2= 0.6 + [(0-0)0+(0-1)1+(0-1)0+(1-1)1]= -0.4

X0 X1 X2 Summation Output
1 0 0 (-1*2.3) + (-0*0.5) + (-0*0.4) = -2.3 0
1 0 1 (-1*2.3) + (-0*0.5) + (-1*0.4) = -2.7 0
1 1 0 (-1*2.3) + (-1*0.5) + (-0*0.4) = -2.8 0
1 1 1 (-1*2.3) + (-1*0.5) + (-1*0.4) = -3.2 0

10

Weight Updation
• W0= -2.3 + [(0-0)1+(0-0)1+(0-0)1+(1-0)1]= -1.3
• W1= -0.5 + [(0-0)0+(0-0)0+(0-0)1+(1-0)1]= 0.5
• W2= -0.4 + [(0-0)0+(0-0)1+(0-0)0+(1-0)1]= 0.6

X0 X1 X2 Summation Output
1 0 0 (-1*1.3) + (0*0.5) + (0*0.6) = -1.3 0
1 0 1 (-1*1.3) + (0*0.5) + (1*0.6) = -0.7 0
1 1 0 (-1*1.3) + (1*0.5) + (0*0.6) = -0.8 0
1 1 1 (-1*1.3) + (1*0.5) + (1*0.6) = -0.2 0

11

Weight Updation
• W0= -1.3 + [(0-0)1+(0-0)1+(0-0)1+(1-0)1]= -0.3
• W1= 0.5 + [(0-0)0+(0-0)0+(0-0)1+(1-0)1]= 1.5
• W2= 0.6 + [(0-0)0+(0-0)1+(0-0)0+(1-0)1]= 1.6

X0 X1 X2 Summation Output
1 0 0 (-1*0.3) + (0*1.5) + (0*1.6) = -0.3 0
1 0 1 (-1*0.3) + (0*1.5) + (1*1.6) = -1.3 0
1 1 0 (-1*0.3) + (1*1.5) + (0*1.6) = 1.2 1
1 1 1 (-1*0.3) + (1*1.5) + (1*1.6) = 2.8 1

12

Weight Updation

What is the K-Means algorithm? ( More Formalised Version)


• W0= -0.3 + [(0-0)1+(0-0)1+(0-1)1+(1-1)1]= -1.3
• W1= 1.5 + [(0-0)0+(0-0)0+(0-1)1+(1-1)1]= 0.5
• W2= 1.6 + [(0-0)0+(0-0)1+(0-1)0+(1-1)1]= 1.6

X0 X1 X2 Summation Output
1 0 0 (-1*1.3) + (0*0.5) + (0*1.6) = -1.3 0
1 0 1 (-1*1.3) + (0*0.5) + (1*1.6) = 0.3 1
1 1 0 (-1*1.3) + (1*0.5) + (0*1.6) = -0.8 0
1 1 1 (-1*1.3) + (1*0.5) + (1*1.6) = 0.8 1

13

Weight Updation
• W0= -1.3 + [(0-0)1+(0-1)1+(0-0)1+(1-1)1]= -2.3
• W1= 0.5 + [(0-0)0+(0-1)0+(0-0)1+(1-1)1]= 0.5
• W2= 1.6 + [(0-0)0+(0-1)1+(0-0)0+(1-1)1]= 0.6

X0 X1 X2 Summation Output
1 0 0 (-1*2.3) + (0*0.5) + (0*0.6) = -2.3 0
1 0 1 (-1*2.3) + (0*0.5) + (1*0.6) = -1.7 0
1 1 0 (-1*2.3) + (1*0.5) + (0*0.6) = -1.8 0
1 1 1 (-1*2.3) + (1*0.5) + (1*0.6) = -1.2 0

14

Weight Updation
• W0= -2.3 + [(0-0)1+(0-0)1+(0-0)1+(1-0)1]= -1.3
• W1= 0.5 + [(0-0)0+(0-0)0+(0-0)1+(1-0)1]= 1.5
• W2= 0.6 + [(0-0)0+(0-0)1+(0-0)0+(1-0)1]= 1.6

X0 X1 X2 Summation Output
1 0 0 (-1*1.3) + (0*1.5) + (0*1.6) = -1.3 0
1 0 1 (-1*1.3) + (0*1.5) + (1*1.6) = 0.3 1
1 1 0 (-1*1.3) + (1*1.5) + (0*1.6) = 0.2 1
1 1 1 (-1*1.3) + (1*1.5) + (1*1.6) = 1.8 1

1. Choose the number of clusters k


15

Weight Updation
• W0= -1.3 + [(0-0)1+(0-1)1+(0-1)1+(1-1)1]= -3.3
• W1= 1.5 + [(0-0)0+(0-1)0+(0-1)1+(1-1)1]= 0.5
• W2= 1.6 + [(0-0)0+(0-1)1+(0-1)0+(1-1)1]= 0.6

X0 X1 X2 Summation Output
1 0 0 (-1*3.3) + (0*0.5) + (0*0.6) = -3.3 0
1 0 1 (-1*3.3) + (0*0.5) + (1*0.6) = -2.7 0
1 1 0 (-1*3.3) + (1*0.5) + (0*0.6) = -2.8 0
1 1 1 (-1*3.3) + (1*0.5) + (1*0.6) = -2.2 0

16

2. Select k random points from the data as centroids ( We later discuss how to optimise this)
Weight Updation
• W0= -3.3 + [(0-0)1+(0-0)1+(0-0)1+(1-0)1]= -2.3
• W1= 0.5 + [(0-0)0+(0-0)0+(0-0)1+(1-0)1]= 1.5
• W2= 0.6 + [(0-0)0+(0-0)1+(0-0)0+(1-0)1]= 1.6

X0 X1 X2 Summation Output
1 0 0 (-1*2.3) + (0*1.5) + (0*1.6) = -2.3 0
1 0 1 (-1*2.3) + (0*1.5) + (1*1.6) = -0.7 0
1 1 0 (-1*2.3) + (1*1.5) + (0*1.6) = -0.8 0
1 1 1 (-1*2.3) + (1*1.5) + (1*1.6) = 0.8 1

17

Limitations of Perceptrons

3. Until Convergence
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as
the corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions

Gradient Descent and the Delta Rule

• The perceptron rule finds a successful weight vector when


the training examples are linearly separable, it can fail to
converge if the examples are not linearly separable.

a. Assignment to a cluster Head/Centroid.


• The delta rule overcomes this difficulty.
• If the training examples are not linearly separable, the delta
rule converges toward a best-fit approximation to the target
concept.
• The key idea behind the delta rule is to use gradient descent
to search the hypothesis space of possible weight vectors to
find the weights that best fit the training examples.

Gradient Descent
• Consider linear unit without threshold and continuous output o (not just
–1,1)
o = w 0 + w 1 x1 + … + w n xn
• Train the wi’s such that they minimize the squared error
E[w0,…,wn] = ½ dD (td - od)2
where D is the set of training examples td is the target output for training
example d, and od is the output of the linear unit for training example d.

Gradient Descent
• The wo, wl plane represents the entire
hypothesis space.
• The vertical axis indicates the error E
relative to some fixed set of training
examples.
• The error surface summarizes the
desirability of every weight vector in the
hypothesis space (we desire a hypothesis
with minimum error).
• The arrow shows the negated gradient at
one particular point, indicating the direction
in the wo, wl plane producing steepest
descent along the error surface.

Gradient descent search determines a weight vector that


minimizes E by starting with an arbitrary initial weight
vector, then repeatedly modifying it in small steps

Comparison Perceptron and Gradient Descent Rule


Perceptron learning rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate 

Linear unit training rules uses gradient descent


• Guaranteed to converge to hypothesis with minimum squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
• Even when training data not separable by H

A solution: multiple layers


output
layer y y

z2

hidden
layer z1 z2 z1

x2

input layer
x1 x2 x1

Power/Expressiveness of Multilayer Networks

• Can represent interactions among inputs


• Two layer networks can represent any Boolean function, and
continuous functions (within a tolerance) as long as the number of
hidden units is sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees than perceptron
learning algorithms
• Multilayer networks are capable of expressing a rich variety of
nonlinear decision surfaces
• Nonlinearity gets introduced over wixi linear equation by use of
activation function after each layer (hidden as well as output)

b. Resetting the Centroid.


• For nonlinear classification, activation fn should be non-linear

Multi-layer feed-forward networks


Multi-layer, feed forward networks extend perceptrons i.e., 1-layer
networks into n-layer by:
Partition units into layers 0 to L :
lowermost layer number, layer 0 indicates the input units
topmost layer numbered L contains the output units.
layers numbered 1 to L-1 are the hidden layers
•Connectivity means bottom-up connections only, with no cycles,
hence the name "feed-forward" nets
•Input layers transmit input values to hidden layer nodes hence do not
perform any computation.

Note: layer number indicates the distance of a node from the input
nodes

Find the mean of all points in a cluster, and set that as the new centroid of the
Multilayer feed forward network

Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer

cluster. ( Hence the name k-means)


Multi-layer feed-forward networks

• Multi-layer feed-forward networks can be trained by back-


propagation provided the activation function g is a
differentiable function.
– Threshold units don’t qualify, but the sigmoid function
does.

• Back-propagation learning is a gradient descent


search through the parameter space to minimize the sum-
of-squares error.
– Most common algorithm for learning algorithms in
multilayer networks

Sigmoid Unit

x0=1
x1 w1 w0 n o=(net)=1/(1+e-net)
w2 net= wi xi

i=0
x2 o
.
.
. wn
xn
(x) is the sigmoid function: 1/(1+e-x)

d(x)/dx = (x) (1- (x))  differentiable function

g:Some Activation functions for units

Transforms neuron’s input into output.

Step function Sign function Sigmoid function


(Linear Threshold Unit)
sign(x) = +1, if x >= 0 sigmoid(x) = 1/(1+e-x)
step(x) = 1, if x >= threshold -1, if x < 0
0, if x < threshold

Rectified Linear Unit (ReLU)

Activation Functions
Introduction to Neural Networks

• Non-linear activations are needed to learn complex (non-linear) data


representations
 Otherwise, NNs would be just a linear function (such as )
 NNs with large number of layers (and neurons) can approximate more complex functions
o Figure: more neurons improve representation (but, may overfit)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 30

Activation: Tanh
Introduction to Neural Networks

• Tanh function: takes a real-valued number and “squashes” it into range between -1
and 1
 Like sigmoid, tanh neurons saturate
 Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
 Tanh is a scaled sigmoid:

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 31

Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at zero

 Most modern deep NNs use ReLU


activations
 ReLU is fast to compute
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of gradient
descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing problem

32

Activation: Leaky ReLU


Introduction to Neural Networks

• The problem of ReLU activations: they can “die”


 ReLU could cause weights to update in a way that the gradients can become zero and the
neuron will not activate again on any data
 E.g., when a large learning rate is used

• Leaky ReLUactivation function is a variant of ReLU


 Instead of the function being 0 when , a leaky ReLU has a small negative slope (e.g.,
α = 0.01, or similar)
 This resolves the dying ReLU problem
 Most current works still use ReLU
o With a proper setting of the learning rate,
the problem of dying ReLU can be avoided

33

Backpropagation learning algorithm ‘BP’


Solution to credit assignment problem in MLP. Rumelhart, Hinton
and Williams (1986) (though actually invented earlier in a PhD
thesis relating to economics)

BP has two phases:


•Forward pass phase: computes ‘functional signal’, feed forward
propagation of input pattern signals through network
•Backward pass phase: computes ‘error signal’, propagates the
error backwards through network starting at output units (where
the error is the difference between actual and desired output
values)

The idea behind backpropagation

Convergence in K Means
• We don’t know what the hidden units ought to do, but we can
compute how fast the error changes as we change a hidden
activity.
– Instead of using desired activities to train the hidden units,
use error derivatives w.r.t. hidden activities.
– Each hidden activity can affect many output units and can
therefore have many separate effects on the error. These
effects must be combined.
– We can compute error derivatives for all the hidden units
efficiently.
– Once we have the error derivatives for the hidden
activities, its easy to get the error derivatives for the
weights going into a hidden unit.

Two-layer back-propagation neural network


Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i w ij j w jk
xi k yk

n1
n n2 y n2
xn
Input Hidden Output
layer layer

Error signals

37

Derivation of Gradient Descent Rule

• How can we calculate the direction of steepest descent along


the error surface?

• This direction can be found by computing the derivative of E


with respect to each component of the vector w0,..,wn.

E[w0,…,wn] = ½ dD (td - od)2

• This vector derivative is called the gradient of E with respect


to w0, w1,..,wn, written E(w0,…,wn)

What is Convergence in K-Means?


Derivation of Gradient Descent Rule
(for multiple Weights: partial derivative )
• Gradient:
E[w0,…,wn] = [E/w0,… E/wn]

• When the gradient is interpreted as a vector in weight space,


the gradient specifies the direction that produces the steepest
increase in E.

• The negative of this vector ( -E[w0,…,wn] ) gives the


direction of steepest decrease.

Training Rule for Gradient Descent(Multiple Weights)

• This training rule can also be written in its component form

Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit
= input
• For each hidden unit h
= target output
,
∈ = observed unit

There is no further change in the assignments of the centroid of clusters, or cluster assignment of
• Update each network weight
, , , output
where
= wt from i to j
, ,

Derivation
• For one output neuron, the error function is

• But, for each intermediate unit , is a summation, as one o/p contributes


to all neurons of next layer with resp weights the output is defined as

( is output from prev layer))


[The input to an activation fn. is the weighted sum of outputs of
previous neurons]
• Finding the derivative of the error:

points in the algorithm.


Derivation (Contd)

• For output layer:

• For output layer:

= )= *( )=

• = * )= ≡

= =

Derivation (For intermediate layers)

(same as already) = )= *( )=

• = * )= ≡
• For intermediate layer: E = ?? there is no observed output, hence it will
be summation of errors for each output to which this neuron contributes
• So for intermediate layer:
• E=

th

Overall Weight updation(output and intermediate layers)


For Output layer neurons:
= 𝒋

=
Here derivative of error ( called as = =
𝝏𝑬
For intermediate layers/unit j: Only will change
𝝏𝒐𝒋
• Recall:

= ( are weights at jth layer)

To update weight 𝒊𝒋 using gradient descent, recall the formula:


, , , * ;
Hence for the jth layer) will be with

Backpropagation Algorithm (Equations Again)


Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit
= input
• For each hidden unit h
= target output
,
∈ = observed unit
• Update each network weight
output
, , ,
where = wt from i to j
, ,

Problems with Gradient Descent


Training Neural Networks

• Besides the local minima problem, the GD algorithm can be very slow at plateaus,
and it can get stuck at saddle points

cost
Very slow at the plateau
Stuck at a saddle point

Stuck at a local minimum

Is Convergence Guaranteed in K-Means?


Slide credit: Hung-yi Lee – Deep Learning Tutorial 47

Gradient Descent with Momentum


Training Neural Networks

• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization

Movement = Negative of Gradient + Momentum


cost

Negative of Gradient
Momentum
Real Movement

Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 48

Backpropagation
• Gradient descent over entire network weight vector
• Can be generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• May include weight momentum

• Training may be slow.


• Using network after training is very fast

Learning Rate
Training Neural Networks

• Learning rate
 The gradient tells us the direction in which the loss has the steepest rate of increase, but it
does not tell us how far along the opposite direction we should step

Yes, in each phase our Objective Function will decrease and will reach a steady state.
 Choosing the learning rate (also called the step size) is one of the most important hyper-
parameter settings for NN training

LR too LR too
small large

50

Learning Rate
Training Neural Networks

• Training loss for different learning rates


 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to reach a solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 51

Learning Rate Scheduling


Training Neural Networks

• Learning rate scheduling is applied to change the values of the learning rate during
the training
 Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
– Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
– In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
» Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it), Minimum
learning rate: 1e-6 (when to stop)

 Warmup is gradually increasing the learning rate initially, and afterward let it cool down
until the end of the training
Exponential decay Cosine decay Warmup

An example plot:
52

Training practices: batch vs. stochastic vs. mini-batch


gradient descent

• Batch gradient descent:


Too slow to converge
1. Calculate outputs for the entire
Gets stuck in local
dataset
minima
2. Accumulate the errors, back-
propagate and update
• Stochastic/online gradient descent: Converges to the solution
1. Feed forward a training example faster
2. Back-propagate the error and Often helps get the system
update the parameters out of local minima
• Mini-batch gradient descent:
• Can you recall any concept studied in
ML, which has analogy to this mini-
batch?

Batch Normalization
• Batch normalization layers act similar to the data preprocessing
steps mentioned earlier
 They calculate the mean μ and variance σ of a batch of input
data, and normalize the data x to a zero mean and unit variance
 I.e.,
• BatchNorm layers alleviate the problems of proper initialization of
the parameters and hyper-parameters
 Result in faster convergence training, allow larger learning rates
 Reduce the internal covariate shift
• BatchNorm layers are inserted immediately after convolutional
layers or fully-connected layers, and before activation layers
 They are very common with convolutional NNs
54

Recall: k-Fold Cross-Validation


• Using k-fold cross-validation for hyper-parameter tuning is
common when the size of the training data is small
 It also leads to a better and less noisy estimate of the model
performance by averaging the results across several folds
• E.g., 5-fold cross-validation (see the figure on the next slide)
1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and
fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the
model on the test data
55

Recall: k-Fold Cross-Validation


• Illustration of a 5-fold cross-validation

Picture from: https://scikit-learn.org/stable/modules/cross_validation.html 56

Learning in epochs
Stopping
• Train the NN on the entire training set over and over again
• Each such episode of training is called an “epoch”

Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error curves.

Overfitting in ANNs

CSC380: Principles of Data Science 3


Local Minima

• NN can get stuck in local minima for small networks.


• For most large networks (many weights) local minima rarely occurs.
• It is unlikely that you are in a minima in every dimension
simultaneously.

ANN: Conclusions

• Highly expressive non-linear functions


• Highly parallel network of logistic function units
• Minimizes sum of squared training errors
• Local minima
• Overfitting

Thank You

• EXTRA EDITED EQUATIONS


For Output layer neurons:
= 𝒋

Here derivative of error ( called as = =

For intermediate layers/unit j: Only will change .


Hence
=∑ (𝛿 𝑤 ) = 𝑜
So For each unit 𝑗, the output 𝑜 is defined as

𝜕𝐸 𝜕𝑜
= 𝑤 = 𝑜 −𝑦 ∗ 𝝋 𝒏𝒆𝒕𝒋 1 − 𝜑 𝑛𝑒𝑡 ∗ 𝑜
𝜕𝑜 𝜕𝑛𝑒𝑡
To update weight 𝒘𝒊𝒋 using gradient descent, recall the formula:
𝑤 , ← 𝑤 , + ∆𝑤 , 𝑤𝑖𝑡ℎ ∆𝑤 = −𝜂 ∗ 𝛿 *𝑜 ;
Hence 𝛿 will be with
𝑜 − 𝑦 𝑜 1 − 𝑜 if 𝑗 is an output neuron
𝛿= =
∑ 𝛿 𝑤 𝑜 1 − 𝑜 if 𝑗 is an inner neuron

Source: Fig 9.2 in Bishop - Pattern Recognition And Machine Learning

Minima Issues in K-Means

K-means may converge at a Local minima than a Global Minima.

Fig Source: Andrew NG Coursera Machine Learning Course

CSC380: Principles of Data Science 4

What are some reasons why a cluster may have just One Point?

1. The Cluster Point is An Outlier


2. Local Minima Was Attained.

The Number Of Topics for K Means

How can we choose a number of Topics for K-Means?

1.The Most Common Approach is to Visualise it and Choose.

2. The Elbow Method

Over a range of k, we compute distortion score (


or a similar metric). When these are plotted as a
line chart, we will observe a point of inflection- an
elbow.

This is a recommended number of Topics.

Source of Image

But sometimes, we may not observe a strict inflection point, like in the fig below.

CSC380: Principles of Data Science 5

Image Source : Andrew NG Machine Learning Machine Learning Course

3. KMeans++ Algorithm

What is the KMeans ++ Algorithm?

KMeans does the first step, assignment of the first round of initialization of centroids in a smarter
initialization of the centroids and improves the quality of the clustering.

From GeeksForGeeks:

1. Randomly select the first centroid from the data points.


2. For each data point compute its distance from the nearest, previously chosen
centroid.
3. Select the next centroid from the data points such that the probability of
choosing a point as centroid is directly proportional to its distance from the
nearest, previously chosen centroid. (i.e. the point having maximum distance
from the nearest centroid is most likely to be selected next as a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled

Video Recommendation: Sara Jensen: K Means++

K Medoids

K-means may not be the most suitable algorithm in some cases, since it is very sensitive to noise and
outliers. While, K-means attempts to minimize the total squared error, while k-medoids minimize

CSC380: Principles of Data Science 6

the sum of dissimilarities between points labeled to be in a cluster and a point designated as the
center of that cluster. In contrast to the k-means algorithm, k-medoids choose datapoints as centers
( medoids or exemplars).[1]

So an improvised K means algorithm - K medoids is used.

K-medoids algorithm from GeeksForGeeks

1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest medoid, recompute the cost.
2. If the total cost is more than that in the previous step, undo the swap.

Reference 1. K-means and K-medoids: applet

Assumptions made by K Means


Simple and straightforward explanation with figures found at mbmlbook.

Summary:
1. All clusters are the same size.
2. Clusters have the same extent in every direction.
3. Clusters have similar numbers of points assigned to them.
Find a demonstration at Demonstration of k-means assumptions — scikit-learn 1.0.1
documentation

Implementation of K Means

sklearn.cluster.KMeans — scikit-learn 1.0.1 documentation

Sample Implementation on the IRIS Dataset

From Scratch Implementation - K Means Clustering | K Means Clustering Algorithm in Python

CSC380: Principles of Data Science 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy