Week#5
Week#5
FACULTY OF ENGINEERING
CAIRO UNIVERSITY
Contents
1. Linear Discriminant Analysis (LDA)
2. Considerations in classification
3. Quadratic Discriminant Analysis (QDA)
• Logistic Regression:
• Doesn't make assumptions about the distribution of input variables.
• It estimates the probability that an instance belongs to a particular class based on a
linear combination of input features.
1. LDA. P=1
• We would like to obtain an estimate for 𝑓𝑘 (𝑥) that we can plug into:
In order to estimate 𝑝𝑘 𝑥 .
1. LDA. P=1
• In order to estimate 𝑓𝑘 (𝑥), we will first make some assumptions about its
form. Suppose a normal distribution form for the 𝑓𝑘 (𝑥):
• 𝜇𝑘 and 𝜎𝑘 are the mean and standard deviation parameters for the k 𝑡ℎ
class. Lets also assume that all classes’ variance are the same (𝜎 2 ), then:
Remember: 𝜋𝑘 represents the overall or prior probability that a randomly chosen observation comes from the 𝑘 𝑡ℎ class
1. LDA. P=1
• The Bayes classifier involves assigning an observation X = x to the class for
which 𝑝𝑘 𝑥 is largest.
• By taking the log and rearranging terms, it is the same as assigning the
observation to the class for which the following is largest:
1. LDA. P=1
• For instance, if K = 2 and 𝜋1 = 𝜋2 ,
then the Bayes classifier assigns an
observation to class 1 if
2𝑥 (𝜇1 – 𝜇2 ) > 𝜇12 − 𝜇22 , and to
class 2 otherwise. In this case, the
Bayes decision boundary
corresponds to the point where:
1. LDA. P=1
• In practice, even if we are quite certain of our assumption that X is drawn from a
Gaussian distribution within each class, we still have to estimate the parameters
𝜇1 , . . . , 𝜇𝐾 , 𝜋1 , . . . , 𝜋𝐾 , 𝑎𝑛𝑑 𝜎 2 .
• The LDA method approximates the Bayes classifier by plugging estimates for
𝜇𝐾 , 𝜋𝐾 , 𝑎𝑛𝑑 𝜎 2 into 𝛿𝑘 𝑥 . In particular, the following estimates are used:
1. LDA. P=1
• Sometimes we have knowledge of the class membership probabilities
𝜋1 , . . . , 𝜋𝐾 , which can be used directly.
• In the absence of any additional information, LDA estimates 𝜋𝑘 using the
proportion of the training observations that belong to the 𝑘 𝑡ℎ class. In
other words:
1. LDA. P>1
• We now extend the LDA classifier to the case of multiple predictors.
• To do this, we will assume that the predictors are drawn from a multi-
variate normal distribution, with a class-specific mean vector and a
common covariance matrix.
1. LDA. P>1
• Multi-variate distribution assumes that each individual predictor follows a
1D normal distribution, with some correlation between each pair of
predictors
1. LDA. P>1
• To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write 𝑋 ∼ 𝑁(𝜇, Σ).
• Here 𝐸(𝑋) = 𝜇 is the mean of X (a vector with p components), and
𝐶𝑜𝑣(𝑋) = Σ is the p × p covariance matrix of X.
1. LDA. P>1
• In the case of p > 1 predictors, the LDA classifier assumes that the
observations in the 𝑘 𝑡ℎ class are drawn from a multivariate Gaussian
distribution 𝑁(𝜇𝑘 , Σ), where 𝜇𝑘 is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.
• The Bayes classifier assigns an observation X = x to the class for which
the discriminant function is largest :
2. Considerations in classification
• We can perform LDA on the Default data in order to predict whether an
individual will default on the basis of credit card balance and student status.
• After fitting, the training error rate of 2.75 %. This sounds like a low error rate,
but two cautions must be noted:
1. Training error rates will usually be lower than test error rates. The higher the ratio of
parameters p to number of samples n, the more we expect this overfitting to play a
role.
2. Since only 3.33% of the individuals in the training sample defaulted, a useless
classifier that always predicts that the individual will not default regardless of his
credit will result in an error rate of 3.33%. It is only a bit higher than the training set
error rate. (The class imbalance problem)
2024-02-17 Machine Learning for Industrial Engineering 17
Progress: 55%
2. Considerations. Threshold
Why does LDA do such a poor job of classifying the customers who default?
• LDA is trying to approximate the Bayes classifier, which has the lowest total
error rate out of all classifiers (if the Gaussian model is correct).
• That is, the Bayes classifier will yield the smallest possible total number of
misclassified observations, irrespective of which class the errors come
from.
2. Considerations. Threshold
Why does LDA do such a poor job of classifying the customers who default?
• We will now see that it is possible to modify LDA in order to develop a
classifier that better meets the credit card company’s needs.
• The Bayes classifier works by assigning an observation to the class for
which the posterior probability pk(X) is greatest. In the two-class case, this
amounts to assigning an observation to the default class if