Pattern Recognition 21BR551 MODULE 02 NOTES
Pattern Recognition 21BR551 MODULE 02 NOTES
Pattern Recognition
(21BR551)
MODULE 02 NOTES
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
2.0 Introduction to Bayesian Classifiers
Bayesian classifiers are a type of statistical classification model that rely on Bayes' Theorem to predict the
probability of a class label given the input features. The key idea behind Bayesian classifiers is to model the
relationship between the features of the data and the target class, using probabilistic reasoning. The general
framework uses Bayes' Theorem to estimate the posterior probabilities of different classes, which are then
used for classification.
Bayes' Theorem states that:
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
This equation determines the points where the posterior probabilities of each class are equal, and
thus, the boundary. For simple synthetic examples, such as two classes with different means and covariances,
the decision boundary might be linear or quadratic.
which can be written in matrix form for multidimensional distributions. This typically results in a quadratic
decision boundary.
Error rates are often used to evaluate the effectiveness of a classifier and to tune model parameters (e.g.,
regularization). Error rate estimation can also be done through cross-validation techniques, where the data is
split into multiple subsets and the model is trained on one subset while validated on others. This provides a
more robust estimate of the error rate, especially when the data is limited.
This allows for a more nuanced evaluation of classifier performance when the costs of different errors vary.
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
2.6 Model-Based Estimates
In model-based estimation, the error rates are computed based on the model's assumptions. For example, in a
Gaussian Naive Bayes model, you may estimate the error rate using the likelihood of observing each class's
features given the model parameters.
Model-based estimates refer to predictions or estimates generated through a statistical or computational model,
which uses existing data and assumptions to predict an outcome or future value. These models can range from
simple linear regression to complex machine learning algorithms, depending on the problem at hand.
Example Problem on Model-Based Estimates (Bayesian Classification)
Problem Statement: You are working with a model to predict whether a person will buy a product based on
their income. You are given the following training data:
Customer Income (in thousands) Purchased (Class)
1 30 Yes
2 50 No
3 40 Yes
4 60 No
5 55 No
6 45 Yes
Now, you are asked to predict if a new customer, who has an income of 47 (thousand), will purchase the
product or not. We will use Bayesian classification to solve this problem, assuming that the income feature
follows a normal distribution for each class ("Yes" and "No").
Step 1: Calculate Prior Probabilities
First, calculate the prior probabilities for the "Yes" and "No" classes, which are based on the frequency of
each class in the dataset.
Step 5: Conclusion
Since 𝑃(𝑌𝑒𝑠 ∣ 47) = 0.030 is greater than 𝑃(𝑁𝑜 ∣ 47) = 0.0265, we predict that the new customer will
purchase the product. This is a simple application of Model-Based Estimation (Bayesian classification) to
classify a new customer based on their income.
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Steps to Calculate Error Rate:
1. Count the total number of observations (emails):
o Total emails = 10
2. Count the number of errors:
o Errors occur when the predicted label does not match the actual label.
o In the table above, errors occurred in email samples 2, 3, 6, 9, and 10.
o Total number of errors = 5
3. Calculate the error rate: The error rate is simply the number of errors divided by the total number of
observations (emails).
Step-by-Step Example:
1. Prediction Probabilities: Suppose our classifier makes the following predictions for four emails:
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
2. Calculate Fractional Counts: Now, we compute the fractional contributions for both the
spam and not spam classes.
Example: Consider a hypothetical example of a medical test to detect a certain disease. There are 100 patients
and 40 of these patients have the disease. We will use this example to create the ROC curve to have an idea
other than the ideal model. Consider that our classification model performed as such:
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Calculation of TPR and FPR is carried out as:
Problem Setup
We have a binary classification problem, where the goal is to predict whether an email is spam or not spam.
The model predicts a probability for each email, and we set a threshold (e.g., 0.5) to classify the email as
spam (1) or not spam (0).
Let’s assume we have the following true labels (the actual class) and the predicted labels for 5 emails:
True Label Predicted Probability Predicted Label
Email
(Actual) (Spam) (Spam: 1, Not Spam: 0)
E1 Spam (1) 0.9 1
E2 Not Spam (0) 0.6 1
E3 Spam (1) 0.8 1
E4 Not Spam (0) 0.2 0
E5 Not Spam (0) 0.3 0
Step 1: Thresholding
We set the threshold to 0.5. That means:
• If the predicted probability for spam is greater than or equal to 0.5, classify as spam (1).
• If the predicted probability for spam is less than 0.5, classify as not spam (0).
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
2.12 Estimating the Composition of Populations
Estimating the composition of populations often refers to determining the proportions of different subgroups
within a population based on sample data. This is commonly done in fields like biology, social science, or
market research.
Example Problem:
Suppose you are studying the population of a town with 10,000 residents. You want to estimate the proportion
of people in the town who own pets (dogs or cats). A simple random sample of 200 people is taken, and out
of the 200 people surveyed, 80 report owning pets.
You want to estimate the proportion of the entire population that owns pets.
Formula:
Calculation:
This means the estimated proportion of people who own pets in the sample is 0.4, or 40%.
Estimating Population Composition:
Since you are dealing with a sample, this proportion is used to estimate the proportion in the entire population.
So, the estimated proportion of people in the entire town who own pets is 0.4 (or 40%).
Estimating the Total Number of Pet Owners:
Now, if you want to estimate the total number of people in the entire population who own pets, you can
multiply the estimated proportion by the total population size (NN):
Thus, you estimate that about 4,000 people in the town own pets.
Summary:
• Estimated proportion of pet owners in the sample: 0.4 or 40%
• Estimated total number of pet owners in the population: 4,000
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Questions
4 Marks Questions
1) Explain the concept of Bayesian classifiers.
2) Discuss the role of Bayes' Theorem in Bayesian classification.
3) What is a decision boundary in the context of classification
4) Given two-dimensional feature data from two classes, describe how you would visualize the decision
boundary
5) Derive the equation for a decision boundary in a D-dimensional feature space using matrix notation
6) Describe the process of estimating error rates in classification models.
7) Explain the impact of unequal costs of error (misclassification)
8) What are model-based estimates in the context of classification
9) Explain the concept of simple counting in the context of classification.
10) Define fractional counting in the context of Bayesian classifiers.
11) What are characteristic curves, how are they used to evaluate the performance of classification
models
12) What is a confusion matrix, and how can it be used to assess the performance of a classification
model
13) How to use to estimate the composition of populations in a classification setting
8 Marks Questions
1) Explain the working of a Bayesian classifier in detail. Describe the role of Bayes' Theorem in the
classification process
2) Describe in detail how decision boundaries are formed in classification problems?
3) Consider a classification problem with two-dimensional data points belonging to two classes.
Explain how to determine and visualize the decision boundary between the two classes.
4) Derive the equation of a decision boundary in a D-dimensional feature space using matrix notation.
5) Discuss the importance of error rate estimation in classification tasks
6) Explain the concept of unequal costs of error in classification problems.
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
Problems with solution for Practice
Problem on Model-Based Estimation:
a) Consider the following scenario where we are trying to classify whether a customer will buy a product
based on their age: We are given the following training data:
Customer Age Purchased (Class)
1 25 Yes
2 35 No
3 45 Yes
4 30 Yes
5 40 No
We want to classify a new customer whose age is 33.
We will use Model-Based Estimation (Bayesian classification) to classify the new customer. Assume the
feature (age) follows a normal distribution in each class, and we'll estimate the likelihoods using the sample
mean and standard deviation.
Solution:
Step 1: Calculate the Prior Probabilities
The prior probabilities for "Yes" and "No" classes are simply the frequencies of each class in the training data:
Step 5: Conclusion
Since 𝑃(𝑌𝑒𝑠 ∣ 33) = 0.027 is greater than 𝑃(𝑁𝑜 ∣ 33) = 0.0216, the model predicts that the customer will
purchase the product.
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
d) Problem on ROC characteristics Curve for Error Rate Estimation
Problem Statement: A medical test is conducted to detect a disease. The test produces the following results:
• True Positives (TP): 80 (patients correctly identified as having the disease)
• False Positives (FP): 20 (healthy patients incorrectly identified as having the disease)
• True Negatives (TN): 70 (healthy patients correctly identified as healthy)
• False Negatives (FN): 30 (patients with the disease incorrectly identified as healthy)
We are asked to calculate the True Positive Rate (Sensitivity) and False Positive Rate (1 - Specificity) and
plot a basic ROC curve.
Solution: ROC Curve for the Medical Test
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006
Mysore University School of Engineering
8J99+QC7, Manasa Gangothiri, Mysuru, Karnataka 570006
So, the confusion matrix will look like this:
Predicted Spam Predicted Ham
Actual Spam 200 50
Actual Ham 50 700
Step 2: Calculate Evaluation Metrics
1. Accuracy: The accuracy of the classifier is the proportion of correct predictions (both spam and ham)
out of all predictions:
Accuracy = 90%
2. Precision: Precision is the proportion of correctly predicted spam emails out of all the emails predicted
as spam:
Precision = 80%
3. Recall (Sensitivity): Recall is the proportion of correctly predicted spam emails out of all the actual
spam emails:
Recall = 80%
4. F1-Score: The F1-score is the harmonic mean of precision and recall:
F1-Score = 80%
f) Problem on Estimating the Composition of Populations
Problem Definition: A researcher wants to estimate the proportion of male and female students in a large
university. Since it is not feasible to survey all students, the researcher randomly selects a sample of 200
students and records the following:
• 120 students are male.
• 80 students are female.
Using the sample, estimate the proportion of male and female students in the entire university.
Solution: The sampling proportion can be directly applied as an estimate of the population proportion.
Prepared by: Mr Thanmay J S, Assistant Professor, Bio-Medical & Robotics Engineering, UoM, SoE, Mysore 57006