0% found this document useful (0 votes)
59 views55 pages

Lecture 9: Classification, LDA: Reading: Chapter 4

1. The document discusses lecture notes on classification and linear discriminant analysis (LDA). 2. LDA estimates the distributions P(X|Y) and P(Y) instead of directly estimating P(Y|X). 3. It models P(X|Y) as a multivariate normal distribution and finds the decision boundaries that maximize P(Y|X) using Bayes' rule. This results in linear decision boundaries.

Uploaded by

Luis Soares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views55 pages

Lecture 9: Classification, LDA: Reading: Chapter 4

1. The document discusses lecture notes on classification and linear discriminant analysis (LDA). 2. LDA estimates the distributions P(X|Y) and P(Y) instead of directly estimating P(Y|X). 3. It models P(X|Y) as a multivariate normal distribution and finds the decision boundaries that maximize P(Y|X) using Bayes' rule. This results in linear decision boundaries.

Uploaded by

Luis Soares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Lecture 9: Classification, LDA

Reading: Chapter 4

STATS 202: Data mining and analysis

Jonathan Taylor, 10/12


Slide credits: Sergio Bacallado

1 / 21
Review: Main strategy in Chapter 4

Find an estimate P̂ (Y | X). Then, given an input x0 , we predict


the response as in a Bayes classifier:

ŷ0 = argmax y P̂ (Y = y | X = x0 ).

2 / 21
Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

3 / 21
Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P̂ (X | Y ): Given the response, what is the distribution of the


inputs.

3 / 21
Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P̂ (X | Y ): Given the response, what is the distribution of the


inputs.

2. P̂ (Y ): How likely are each of the categories.

3 / 21
Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P̂ (X | Y ): Given the response, what is the distribution of the


inputs.

2. P̂ (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P̂ (X = x | Y = k)P̂ (Y = k)
P̂ (Y = k | X = x) =
P̂ (X = x)

3 / 21
Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P̂ (X | Y ): Given the response, what is the distribution of the


inputs.

2. P̂ (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P̂ (X = x | Y = k)P̂ (Y = k)
P̂ (Y = k | X = x) = P
j P̂ (X = x | Y = j)P̂ (Y = j)

3 / 21
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:

1. We model P̂ (X = x | Y = k) = fˆk (x) as a Multivariate


Normal Distribution:
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

4 / 21
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:

1. We model P̂ (X = x | Y = k) = fˆk (x) as a Multivariate


Normal Distribution:
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

2. P̂ (Y = k) = π̂k is estimated by the fraction of training


samples of class k.

4 / 21
LDA has linear decision boundaries

Suppose that:

5 / 21
LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = πk exactly.

5 / 21
LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
1 1 T Σ−1 (x−µ )
fk (x) = e− 2 (x−µk ) k
(2π)p/2 |Σ|1/2

5 / 21
LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
1 1 T Σ−1 (x−µ )
fk (x) = e− 2 (x−µk ) k
(2π)p/2 |Σ|1/2

µk : Mean of the inputs for category k.


Σ : Covariance matrix (common to all categories).

5 / 21
LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
1 1 T Σ−1 (x−µ )
fk (x) = e− 2 (x−µk ) k
(2π)p/2 |Σ|1/2

µk : Mean of the inputs for category k.


Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?

5 / 21
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:

fk (x)πk
P (Y = k | X = x) =
P (X = x)

6 / 21
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:

fk (x)πk
P (Y = k | X = x) =
P (X = x)

The denominator does not depend on the response k, so we can


write it as a constant:

P (Y = k | X = x) = C × fk (x)πk

6 / 21
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:

fk (x)πk
P (Y = k | X = x) =
P (X = x)

The denominator does not depend on the response k, so we can


write it as a constant:

P (Y = k | X = x) = C × fk (x)πk

Now, expanding fk (x):

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

6 / 21
LDA has linear decision boundaries

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

7 / 21
LDA has linear decision boundaries

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

Now, let us absorb everything that does not depend on k into a


constant C 0 :
1 T Σ−1 (x−µ )
P (Y = k | X = x) = C 0 πk e− 2 (x−µk ) k

7 / 21
LDA has linear decision boundaries

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

Now, let us absorb everything that does not depend on k into a


constant C 0 :
1 T Σ−1 (x−µ )
P (Y = k | X = x) = C 0 πk e− 2 (x−µk ) k

and take the logarithm of both sides:


1
log P (Y = k | X = x) = log C 0 + log πk − (x − µk )T Σ−1 (x − µk ).
2

7 / 21
LDA has linear decision boundaries

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

Now, let us absorb everything that does not depend on k into a


constant C 0 :
1 T Σ−1 (x−µ )
P (Y = k | X = x) = C 0 πk e− 2 (x−µk ) k

and take the logarithm of both sides:


1
log P (Y = k | X = x) = log C 0 + log πk − (x − µk )T Σ−1 (x − µk ).
2
This is the same for every category, k.

7 / 21
LDA has linear decision boundaries

Cπk 1 T −1
P (Y = k | X = x) = p/2 1/2
e− 2 (x−µk ) Σ (x−µk )
(2π) |Σ|

Now, let us absorb everything that does not depend on k into a


constant C 0 :
1 T Σ−1 (x−µ )
P (Y = k | X = x) = C 0 πk e− 2 (x−µk ) k

and take the logarithm of both sides:


1
log P (Y = k | X = x) = log C 0 + log πk − (x − µk )T Σ−1 (x − µk ).
2
This is the same for every category, k.
So we want to find the maximum of this over k.

7 / 21
LDA has linear decision boundaries

Goal, maximize the following over k:


1
log πk − (x − µk )T Σ−1 (x − µk ).
2

8 / 21
LDA has linear decision boundaries

Goal, maximize the following over k:


1
log πk − (x − µk )T Σ−1 (x − µk ).
2
1  T −1
x Σ x + µTk Σ−1 µk + xT Σ−1 µk

= log πk −
2

8 / 21
LDA has linear decision boundaries

Goal, maximize the following over k:


1
log πk − (x − µk )T Σ−1 (x − µk ).
2
1  T −1
x Σ x + µTk Σ−1 µk + xT Σ−1 µk

= log πk −
2
1
=C 00 + log πk − µTk Σ−1 µk + xT Σ−1 µk
2

8 / 21
LDA has linear decision boundaries

Goal, maximize the following over k:


1
log πk − (x − µk )T Σ−1 (x − µk ).
2
1  T −1
x Σ x + µTk Σ−1 µk + xT Σ−1 µk

= log πk −
2
1
=C 00 + log πk − µTk Σ−1 µk + xT Σ−1 µk
2
We define the objective:
1
δk (x) = log πk − µTk Σ−1 µk + xT Σ−1 µk
2
At an input x, we predict the response with the highest δk (x).

8 / 21
LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2


classes do just as well:

δk (x) = δ` (x)

9 / 21
LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2


classes do just as well:

δk (x) = δ` (x)
1 1
log πk − µTk Σ−1 µk + xT Σ−1 µk = log π` − µT` Σ−1 µ` + xT Σ−1 µ`
2 2

9 / 21
LDA has linear decision boundaries

What is the decision boundary? It is the set of points in which 2


classes do just as well:

δk (x) = δ` (x)
1 1
log πk − µTk Σ−1 µk + xT Σ−1 µk = log π` − µT` Σ−1 µ` + xT Σ−1 µ`
2 2
This is a linear equation in x.
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

9 / 21
Estimating πk

#{i ; yi = k}
π̂k =
n

In English, the fraction of training samples of class k.

10 / 21
Estimating the parameters of fk (x)
Estimate the center of each class µk :
1 X
µ̂k = xi
#{i ; yi = k}
i ; yi =k

11 / 21
Estimating the parameters of fk (x)
Estimate the center of each class µk :
1 X
µ̂k = xi
#{i ; yi = k}
i ; yi =k

Estimate the common covariance matrix Σ:

11 / 21
Estimating the parameters of fk (x)
Estimate the center of each class µk :
1 X
µ̂k = xi
#{i ; yi = k}
i ; yi =k

Estimate the common covariance matrix Σ:

I One predictor (p = 1):


K
2 1 X X
σ̂ = (xi − µ̂k )2 .
n−K
k=1 i ; yi =k

11 / 21
Estimating the parameters of fk (x)
Estimate the center of each class µk :
1 X
µ̂k = xi
#{i ; yi = k}
i ; yi =k

Estimate the common covariance matrix Σ:

I One predictor (p = 1):


K
2 1 X X
σ̂ = (xi − µ̂k )2 .
n−K
k=1 i ; yi =k

I Many predictors (p > 1): Compute the vectors of deviations


(x1 − µ̂y1 ), (x2 − µ̂y2 ), . . . , (xn − µ̂yn ) and use an unbiased
estimate of its covariance matrix, Σ.

11 / 21
LDA prediction
For an input x, predict the class with the largest:
1
δ̂k (x) = log π̂k − µ̂Tk Σ̂−1 µ̂k + xT Σ̂−1 µ̂k
2

12 / 21
LDA prediction
For an input x, predict the class with the largest:
1
δ̂k (x) = log π̂k − µ̂Tk Σ̂−1 µ̂k + xT Σ̂−1 µ̂k
2
The decision boundaries are defined by:
1 1
log π̂k − µ̂Tk Σ̂−1 µ̂k + xT Σ̂−1 µ̂k = log π̂` − µ̂T` Σ̂−1 µ̂` + xT Σ̂−1 µ̂`
2 2

12 / 21
LDA prediction
For an input x, predict the class with the largest:
1
δ̂k (x) = log π̂k − µ̂Tk Σ̂−1 µ̂k + xT Σ̂−1 µ̂k
2
The decision boundaries are defined by:
1 1
log π̂k − µ̂Tk Σ̂−1 µ̂k + xT Σ̂−1 µ̂k = log π̂` − µ̂T` Σ̂−1 µ̂` + xT Σ̂−1 µ̂`
2 2
Solid lines in:
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

12 / 21
Quadratic discriminant analysis (QDA)

The assumption that the inputs of every class have the same
covariance Σ can be quite restrictive:
2

2
1

1
0

0
X2

X2
−1

−1
−2

−2
−3

−3
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

13 / 21
Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean µ̂k and a


covariance matrix Σ̂k for each class separately.

14 / 21
Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean µ̂k and a


covariance matrix Σ̂k for each class separately.

Given an input, it is easy to derive an objective function:


1 1 T −1 1
δk (x) = log πk − µTk Σ−1 T −1
k µk + x Σk µk − 2 x Σk x − 2 log |Σk |
2

14 / 21
Quadratic discriminant analysis (QDA)

In quadratic discriminant analysis we estimate a mean µ̂k and a


covariance matrix Σ̂k for each class separately.

Given an input, it is easy to derive an objective function:


1 1 T −1 1
δk (x) = log πk − µTk Σ−1 T −1
k µk + x Σk µk − 2 x Σk x − 2 log |Σk |
2
This objective is now quadratic in x and so are the decision
boundaries.

14 / 21
Quadratic discriminant analysis (QDA)

I Bayes boundary (– – –)
I LDA (· · · · · · )
I QDA (——).
2

2
1

1
0

0
X2

X2
−1

−1
−2

−2
−3

−3
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

15 / 21
Evaluating a classification method

We have talked about the 0-1 loss:


m
1 X
1(yi 6= ŷi ).
m
i=1

It is possible to make the wrong prediction for some classes more


often than others. The 0-1 loss doesn’t tell you anything about this.

16 / 21
Evaluating a classification method

We have talked about the 0-1 loss:


m
1 X
1(yi 6= ŷi ).
m
i=1

It is possible to make the wrong prediction for some classes more


often than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusion
matrix:

16 / 21
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

17 / 21
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

I The error rate among people who do not default (false


positive rate) is very low.

17 / 21
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

I The error rate among people who do not default (false


positive rate) is very low.
I However, the rate of false negatives is 76%.

17 / 21
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

I The error rate among people who do not default (false


positive rate) is very low.
I However, the rate of false negatives is 76%.
I It is possible that false negatives are a bigger source of
concern!

17 / 21
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

I The error rate among people who do not default (false


positive rate) is very low.
I However, the rate of false negatives is 76%.
I It is possible that false negatives are a bigger source of
concern!
I One possible solution: Change the threshold.
17 / 21
Example. Predicting default

Changing the threshold to 0.2 makes it easier to classify to “yes”.

Predicted “yes” if P (default = yes|X) > 0.2.

18 / 21
Example. Predicting default

Changing the threshold to 0.2 makes it easier to classify to “yes”.

Predicted “yes” if P (default = yes|X) > 0.2.

Note that the rate of false positives became higher! That is the
price to pay for fewer false negatives.

18 / 21
Example. Predicting default

Let’s visualize the dependence of the error on the threshold:

0.6
Error Rate

0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

Threshold

I – – – False negative rate (error for defaulting customers)


I · · · · · False positive rate (error for non-defaulting customers)
I —— 0-1 loss or total error rate.

19 / 21
Example. The ROC curve

ROC Curve
I Displays the performance of
1.0

the method for any choice of


threshold.
0.8
True positive rate

0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

20 / 21
Example. The ROC curve

ROC Curve
I Displays the performance of
1.0

the method for any choice of


threshold.
0.8
True positive rate

I The area under the curve


0.6

(AUC) measures the quality


0.4

of the classifier:
0.2

I 0.5 is the AUC for a


random classifier
0.0

0.0 0.2 0.4 0.6 0.8 1.0


I The closer AUC is to 1,
the better.
False positive rate

20 / 21
Next time

I Comparison of logistic regression, LDA, QDA, and KNN


classification.
I Start Chapter 5: Resampling.

21 / 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy