0% found this document useful (0 votes)
8 views33 pages

Week#5

Uploaded by

mohamedggharib02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Week#5

Uploaded by

mohamedggharib02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Progress: 3%

FACULTY OF ENGINEERING
CAIRO UNIVERSITY

Machine Learning for Industrial Engineering


Week#5: Classification 2
Ahmed Sakr, PhD, eng.

2024-02-17 Machine Learning for Industrial Engineering 1


Progress: 6%

Contents
1. Linear Discriminant Analysis (LDA)
2. Considerations in classification
3. Quadratic Discriminant Analysis (QDA)

2024-02-17 Machine Learning for Industrial Engineering 2


Progress: 9%

1. Linear Discriminant Analysis (LDA)


• Logistic regression involves directly modeling Pr(𝑌 = 𝑘|𝑋 = 𝑥) using the
logistic function.

• We now consider an alternative and less direct approach to estimating


these probabilities.

• In this alternative approach, we model the distribution of the predictors X


separately in each of the response classes (i.e. given Y ), and then use
Bayes’ theorem to flip these around into estimates for Pr(Y = k|X = x).

2024-02-17 Machine Learning for Industrial Engineering 3


Progress: 12%

1. LDA. Why not logistic?


Why do we need another method, when we have logistic regression?
1. When classes are well-separated, the parameter estimates for the
logistic model are unstable.

2. If n is small, and X is approximately normal in each of the classes, the


LDA is more stable.

3. Popular for more than 2 classes

2024-02-17 Machine Learning for Industrial Engineering 4


Progress: 15%

1. LDA. Why not logistic?


• Linear Discriminant Analysis (LDA):
• Assumes the input variables have a normal distribution within each class.
• It seeks to find linear combinations of features that best separate different classes
while minimizing the variance within each class.

• Logistic Regression:
• Doesn't make assumptions about the distribution of input variables.
• It estimates the probability that an instance belongs to a particular class based on a
linear combination of input features.

2024-02-17 Machine Learning for Industrial Engineering 5


Progress: 18%

1. LDA. Why not logistic?


• If the sample size is small and the distribution of the input variable is
approximately normal within each class, LDA is considered more stable
than logistic regression.

• This is because LDA makes specific assumptions about the normal


distribution of input variables, and when these assumptions hold, it can
perform well even with limited data.
• Logistic regression, being more flexible in terms of distribution
assumptions, might be prone to overfitting or instability when the
sample size is small.
2024-02-17 Machine Learning for Industrial Engineering 6
Progress: 21%

1. LDA. P=1
• We would like to obtain an estimate for 𝑓𝑘 (𝑥) that we can plug into:

In order to estimate 𝑝𝑘 𝑥 .

• We will then classify an observation to the class for which 𝑝𝑘 𝑥 is greatest.

2024-02-17 Machine Learning for Industrial Engineering 7


Progress: 24%

1. LDA. P=1
• In order to estimate 𝑓𝑘 (𝑥), we will first make some assumptions about its
form. Suppose a normal distribution form for the 𝑓𝑘 (𝑥):

• 𝜇𝑘 and 𝜎𝑘 are the mean and standard deviation parameters for the k 𝑡ℎ
class. Lets also assume that all classes’ variance are the same (𝜎 2 ), then:

Remember: 𝜋𝑘 represents the overall or prior probability that a randomly chosen observation comes from the 𝑘 𝑡ℎ class

2024-02-17 Machine Learning for Industrial Engineering 8


Progress: 27%

1. LDA. P=1
• The Bayes classifier involves assigning an observation X = x to the class for
which 𝑝𝑘 𝑥 is largest.
• By taking the log and rearranging terms, it is the same as assigning the
observation to the class for which the following is largest:

• The 𝛿𝑘 𝑥 is called the discriminant functions

2024-02-17 Machine Learning for Industrial Engineering 9


Progress: 30%

1. LDA. P=1
• For instance, if K = 2 and 𝜋1 = 𝜋2 ,
then the Bayes classifier assigns an
observation to class 1 if
2𝑥 (𝜇1 – 𝜇2 ) > 𝜇12 − 𝜇22 , and to
class 2 otherwise. In this case, the
Bayes decision boundary
corresponds to the point where:

2024-02-17 Machine Learning for Industrial Engineering 10


Progress: 33%

1. LDA. P=1
• In practice, even if we are quite certain of our assumption that X is drawn from a
Gaussian distribution within each class, we still have to estimate the parameters
𝜇1 , . . . , 𝜇𝐾 , 𝜋1 , . . . , 𝜋𝐾 , 𝑎𝑛𝑑 𝜎 2 .

• The LDA method approximates the Bayes classifier by plugging estimates for
𝜇𝐾 , 𝜋𝐾 , 𝑎𝑛𝑑 𝜎 2 into 𝛿𝑘 𝑥 . In particular, the following estimates are used:

where n is the total number of training observations, and 𝑛𝑘 is the number of


training observations in the 𝑘𝑡ℎ class.
2024-02-17 Machine Learning for Industrial Engineering 11
Progress: 36%

1. LDA. P=1
• Sometimes we have knowledge of the class membership probabilities
𝜋1 , . . . , 𝜋𝐾 , which can be used directly.
• In the absence of any additional information, LDA estimates 𝜋𝑘 using the
proportion of the training observations that belong to the 𝑘 𝑡ℎ class. In
other words:

2024-02-17 Machine Learning for Industrial Engineering 12


Progress: 39%

1. LDA. P>1
• We now extend the LDA classifier to the case of multiple predictors.

• To do this, we will assume that the predictors are drawn from a multi-
variate normal distribution, with a class-specific mean vector and a
common covariance matrix.

2024-02-17 Machine Learning for Industrial Engineering 13


Progress: 42%

1. LDA. P>1
• Multi-variate distribution assumes that each individual predictor follows a
1D normal distribution, with some correlation between each pair of
predictors

2024-02-17 Machine Learning for Industrial Engineering 14


Progress: 45%

1. LDA. P>1
• To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write 𝑋 ∼ 𝑁(𝜇, Σ).
• Here 𝐸(𝑋) = 𝜇 is the mean of X (a vector with p components), and
𝐶𝑜𝑣(𝑋) = Σ is the p × p covariance matrix of X.

• Multivariate Gaussian density is defined as

2024-02-17 Machine Learning for Industrial Engineering 15


Progress: 48%

1. LDA. P>1
• In the case of p > 1 predictors, the LDA classifier assumes that the
observations in the 𝑘 𝑡ℎ class are drawn from a multivariate Gaussian
distribution 𝑁(𝜇𝑘 , Σ), where 𝜇𝑘 is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.
• The Bayes classifier assigns an observation X = x to the class for which
the discriminant function is largest :

• Once again, we need to estimate the unknown parameters


𝜇1 , . . . , 𝜇𝐾 , 𝜋1 , . . . , 𝜋𝐾 , 𝑎𝑛𝑑 Σ; the formulas are similar to those used in
the one-dimensional case
2024-02-17 Machine Learning for Industrial Engineering 16
Progress: 52%

2. Considerations in classification
• We can perform LDA on the Default data in order to predict whether an
individual will default on the basis of credit card balance and student status.
• After fitting, the training error rate of 2.75 %. This sounds like a low error rate,
but two cautions must be noted:

1. Training error rates will usually be lower than test error rates. The higher the ratio of
parameters p to number of samples n, the more we expect this overfitting to play a
role.
2. Since only 3.33% of the individuals in the training sample defaulted, a useless
classifier that always predicts that the individual will not default regardless of his
credit will result in an error rate of 3.33%. It is only a bit higher than the training set
error rate. (The class imbalance problem)
2024-02-17 Machine Learning for Industrial Engineering 17
Progress: 55%

2. Considerations. Confusion matrix


• It is often of interest to determine which of these two types of errors are
being made.

2024-02-17 Machine Learning for Industrial Engineering 18


Progress: 58%

2. Considerations. Confusion matrix


• Hence only 23 out of 9, 667 of the individuals who did not default were
incorrectly labeled.
• This looks like a pretty low error rate! However, of the 333 individuals
who defaulted, 252 (or 75.7%) were missed by LDA.
• So, while the overall error rate is low, the error rate among individuals
who defaulted is very high.

2024-02-17 Machine Learning for Industrial Engineering 19


Progress: 61%

2. Considerations. Confusion matrix


• From the perspective of a credit card company, an error rate of 75.7%
may well be unacceptable.

• Class-specific performance is also important in medicine and biology,


where the terms sensitivity and specificity characterize the performance
of a classifier or screening test.
• In this case the sensitivity is the percentage of true defaulters that are identified,
a low 24.3% (81/333) in this case
• The specificity is the percentage of non-defaulters that are correctly identified,
here (1 − 23/9 667)× 100 = 99.8%

2024-02-17 Machine Learning for Industrial Engineering 20


Progress: 64%

2. Considerations. Confusion matrix.classes>2


• True Positives (TP): These are the cases where
the model correctly predicts positive instances
of a certain class.
• True Negatives (TN): These are the cases where
the model correctly predicts negative instances
of a certain class.
• False Positives (FP): These are the cases where
the model incorrectly predicts positive instances
when the actual label is negative with respect to
each class.
• False Negatives (FN): These are the cases where
the model incorrectly predicts negative
instances when the actual label is positive with
respect to each class.

2024-02-17 Machine Learning for Industrial Engineering 21


Progress: 67%

2. Considerations. Confusion matrix.classes>2


• Sensitivity (True Positive Rate): 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
• A higher sensitivity indicates a lower rate of false negatives, indicating
that the model is effectively capturing anomalies.
• Example: if a model yields 100% sensitivity for class 4, it means that
100% of predictions for class 4 belong to class 4.

• Specificity (True Negative Rate): 𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)


• A higher specificity indicates a lower rate of false positives, indicating
that the model is accurately distinguishing normal samples.
• Example: if a model yields 100% specificity for class 4, it means that
100% predictions for non-class 4, do not belong to class 4.

2024-02-17 Machine Learning for Industrial Engineering 22


Progress: 70%

2. Considerations. Confusion matrix.classes>2


• Accuracy: measures the overall correctness of the model's predictions
(𝑇𝑃 + 𝑇𝑁)/Total number of instances

2024-02-17 Machine Learning for Industrial Engineering 23


Progress: 73%

2. Considerations. Threshold
Why does LDA do such a poor job of classifying the customers who default?
• LDA is trying to approximate the Bayes classifier, which has the lowest total
error rate out of all classifiers (if the Gaussian model is correct).
• That is, the Bayes classifier will yield the smallest possible total number of
misclassified observations, irrespective of which class the errors come
from.

2024-02-17 Machine Learning for Industrial Engineering 24


Progress: 76%

2. Considerations. Threshold
Why does LDA do such a poor job of classifying the customers who default?
• We will now see that it is possible to modify LDA in order to develop a
classifier that better meets the credit card company’s needs.
• The Bayes classifier works by assigning an observation to the class for
which the posterior probability pk(X) is greatest. In the two-class case, this
amounts to assigning an observation to the default class if

2024-02-17 Machine Learning for Industrial Engineering 25


Progress: 79%

2. Considerations. Lowering the threshold


Why does LDA do such a poor job of classifying the customers who default?
• So, the LDA, uses a threshold of 50% for the posterior probability of
default to assign an observation to the default class.
• However, if we are concerned about incorrectly predicting the default
status for individuals who default, then we can consider lowering this
threshold.

2024-02-17 Machine Learning for Industrial Engineering 26


Progress: 82%

2. Considerations. Lowering the threshold


Why does LDA do such a poor job of classifying the customers who default?
• Now LDA predicts that 430 individuals will default. Of the 333 individuals
who default, LDA correctly predicts all but 138, or 41.4%.
• This is a vast improvement over the error rate of 75.7% that resulted from
using the threshold of 50%.

2024-02-17 Machine Learning for Industrial Engineering 27


Progress: 85%

2. Considerations. Lowering the threshold


Why does LDA do such a poor job of classifying the customers who default?
• However, this improvement comes at a cost: now 235 individuals who do
not default are incorrectly classified.
• As a result, the overall error rate has increased slightly to 3.73 %.
• But a credit card company may consider this slight increase in the total
error rate to be a small price to pay for more accurate identification of
individuals who do indeed default.

2024-02-17 Machine Learning for Industrial Engineering 28


Progress: 88%

2. Considerations. Lowering the threshold

2024-02-17 Machine Learning for Industrial Engineering 29


Progress: 91%

2. Considerations. ROC curve


The Receiver Operating Characteristic
(ROC) curve is a graphical
representation used to assess the
performance of a binary classification
model across different threshold
settings. It provides a comprehensive
view of the trade-off between
sensitivity and specificity for different
decision thresholds.

2024-02-17 Machine Learning for Industrial Engineering 30


Progress: 94%

3. Quadratic Discriminant Analysis (QDA)


• Assumption about Covariance:
• LDA: Assumes that the covariance of the features is the same for all classes,
meaning there is a common covariance matrix.
• QDA: Allows for different covariances for each class, resulting in a separate
covariance matrix for each class.
• Decision Boundary:
• LDA: Assumes a linear decision boundary. It is computationally simpler and may
perform well when the linear assumption holds.
• QDA: Allows for a quadratic decision boundary, providing more flexibility. It can
capture more complex relationships between features.

2024-02-17 Machine Learning for Industrial Engineering 31


Progress: 97%

3. Quadratic Discriminant Analysis (QDA)


• Sample Size Consideration:
• LDA: Generally performs well when the sample size is small compared to the number of
features.
• QDA: May perform well with a larger sample size but can be computationally demanding
when the number of features is high.
• Computational Complexity:
• LDA: Typically computationally simpler because it involves estimating a common covariance
matrix.
• QDA: Involves estimating separate covariance matrices for each class, making it
computationally more intensive.
• Performance in Practice:
• LDA: Often performs well in practice, especially when the assumptions about the
covariance structure hold.
• QDA: Can capture more complex relationships and may outperform LDA when the features'
covariance structures differ significantly across classes.

2024-02-17 Machine Learning for Industrial Engineering 32


Progress: 100%

3. Quadratic Discriminant Analysis (QDA)

2024-02-17 Machine Learning for Industrial Engineering 33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy