Lecture 03 Bayes Classifier With Prob Concepts
Lecture 03 Bayes Classifier With Prob Concepts
Machine Learning
1
Definition
machine learning: improving some performance measure with experience
computed from the data
2
Recap
• Supervised learning
• Unsupervised learning
• Reinforcement learning
3
4
https://www.frontiersin.org/articles/10.3389/fphar.2021.720694/full
Today:
• Learning to answer Yes/No → CLASSIFICATION Problem
• Bayes Classifier
• K-nearest neighbours
5
Examples
• Spam identification
6
Classification problems
• The response Y to a feature X (either a scalar or a vector) is qualitative
7
Bayes Classification with Domain Knowledge
A basic ML problem (essentially probability estimation in this case)
• The salmon available in a given market are 70% freshwater and 30%
seawater fish. You decide to model their lengths using different normal
distributions.
• The local fish expert, Mr. Know-It-All, tells you that the lengths of freshwater
salmon are normally distributed with a mean of 27 inches and standard
deviation of 1 inch, whereas the lengths of seawater salmon are normally
distributed with a mean of 30 inches and standard deviation of 1 inch.
• You randomly select a salmon from the market and find it to be 29 inches
long. What is the probability that the selected salmon is a freshwater fish?
8
Bayes Rule
• A university institutes an infectious disease testing program in which students are tested weekly with a
rapid test and those who test positive for the disease are quarantined immediately. Unfortunately, tests
are not completely accurate and the results are accurate 98% of the time whether or not the student has
the disease. The university is very successful in its testing program, as a result of which only 1% of the
students taking a test on any given day actually have the disease (whether or not they test positive).
What is the probability that a quarantined student has the disease?
9
Model via Normal Distribution
11
Performance of classification
• Performance of the model C is measured using misclassification error rate:
• i.e. the average number of misclassifications based on the labelled data given
to us
• Confusion matrix
• Error metrics
12
LDA: Linear Discriminant Analysis
• Also known as Normal discriminant analysis (NDA)
• LDA algorithms model the data distribution for each class and use Bayes'
theorem
• Usually very quick, works well for problems having a fewer number of
features
13
• Bayes calculates conditional probabilities—the probability of an event given some
other event has occurred
• LDA does this by projecting data with two or more dimensions into one dimension so
that it can be more easily classified.
• The technique is, therefore, sometimes referred to as dimensionality reduction
14
https://www.ibm.com/topics/linear-discriminant-analysis
Linear Discriminant Analysis
Bayes’ theorem:
• For an observation x*, estimate the value of the discriminant 𝛿 for each class
16
LDA vs Bayes REFER to whiteboard notes for examples and equations
LDA Bayes
Uncorrelated features
19
Introduction to Statistical Learning with Python, 2023
sklearn implementation - LDA
import numpy as np
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
Detailed example:
https://scikit-learn.org/stable/auto_examples/classification/plot_lda.html#sphx-glr-auto-
examples-classification-plot-lda-py
20
sklearn implementation - KNN
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([[1., 1., 1.]]))
21
Confusion Matrix
• Possible outcomes when a classifier model is applied to a labelled data
22
Introduction to Statistical Learning with Python, 2023
Performance metric: ROC curve
• Receiver Operating Characteristics
• (from communications theory)
23
Introduction to Statistical Learning with Python, 2023
Naïve Bayes
• This method presents another approach to estimate fk (that is different from LDA) → ‘naïve’
because it’s coming from a “naïve” approximation of f.
• Instead of assuming that these functions belong to a particular family of distributions (e.g.
multivariate normal), we assume that:
• “Within the kth class, the d features are independent”
• 𝑓𝑘 𝑥 = 𝑓𝑘1 𝑥1 × 𝑓𝑘2 𝑥2 × ⋯ 𝑓𝑘𝑑 𝑥𝑑
• This is a fairly big assumption – ignores joint distribution between the features
• i.e. let go of off-diagonal elements in covariance matrix
• The posterior probability is therefore evaluated as: (note the subscript ‘p instead of usual ‘d’
for no. of features)
• Each f can be obtained through any probability distribution function that relates to the data
24
Estimating f ’s
25
Example
• 2 classes, 3 features out of which 2 are quantitative and 1 qualitative
• Let’s consider 𝜋1 = 𝜋2 = 0.5 (equal no. of responses in 2 classes)
• Density functions: fkj: for k = 1,2 and j = 1,2,3 displayed below.
27
Comparison
28
Introduction to Statistical Learning with Python, 2023
K Nearest Neighbours
Binary (2 class)
problem
Method:
KNN decision
boundary
Let us take K = 3
Surprisingly close
Green circle to
to Bayes classifer!
capture 3 points
around the test
point x.
If number fraction
of points of 1 class
greater than other,
assign that to x.
29
Introduction to Statistical Learning with Python, 2023
K-nearest neighbours
• Labelled data: (x1, x2, k)
• k = {1,0} binary and
• k = {1,2,3..m} multi-class
• Consider any point (x1, x2)
from the data set
30
Decision boundary
• Line that separates the
two labels
31
K = 10
Algorithm:
➢ Define K
➢ Obtain the response labels
of K nearest neighbours for
an observation x*
➢ Evaluate number fractions
➢ Assign class ‘j’ for which
the number fraction of
class j responses in the
region No is the most (i.e.
Pr > 0.5 from below
equation)
32
Overfit and underfit models
34
sklearn implementation - KNN
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([[1., 1., 1.]]))
35
KNN
• A simple and intuitive way to classify
• Flexible
• (read more from ISLP Book)
36
Homework on Application
• Use the iris flower data and develop following classifier models
37
Perceptron Learning Algorithm
• Methods so far (Bayes, LDA, KNN) are more of assigning a data point to a
class based on set of rules
• What do we mean by learning: As more data points arrive, one case establish
an improved performance measure.
38
Example
• Learning logical AND gate • Credit card approval
X1 X2 Y age 23 years
0 0 0 gender female
annual salary INR 10,00,000
0 1 0
year in residence 1 year
1 0 0
year in job 0.5 year
1 1 1
current debt 2,00,000
39
‘Perceptron’
• Inputs: 𝐱 = (𝑥1 , 𝑥2 , 𝑥3 … . . 𝑥𝑑 )
• For AND gate: d=2
• Credit card: d=5
40
Vector form
• h(x) = 0.5(sign(wTx) + 1)
• h(x) = 1 if sign(wTx) = 1
• h(x) = 0 if sign(wTx) = -1
41
Inner product
• each ‘tall’ w represents a hypothesis h & is multiplied with ‘tall’ x
• Stick to ‘tall’ versions of vectors to simplify notation
Recall:
=𝐰∙𝐱= w
Recall:
42
How do perceptrons look like?
• In ℝ1 : ℎ x = sign −|𝑤0 | + 𝑤1𝑥 = sign 𝑤0 𝑤1 ∙ 1
𝑥
𝑤0
Misclassification
𝑤0
43
How do perceptrons look like?
• In ℝ2 : ℎ x = sign 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
Hypothesis – 2
(happens to be correct
Hypothesis - 1 classification)
44
Question
Task: Identify spam SMS.
• Assume each email in your sso-webmail is represented by frequency of keyword occurrence.
Output of +1 → spam. Which of the following sets of features shall have large positive
weights for a perceptron model?
1. Coffee, Tea, Samosa, Dosa
2. Free, deal, save, discount
3. Machine, statistics, data, book
4. Moodi, techfest, e-summit, freshman
45
How do we arrive at the correct hypothesis g
• ℋ = all possible perceptrons, 𝑔 =?
• Need: 𝑔 ≈ 𝑓
• Ideally: 𝑔 𝐱𝑗 = 𝑓 𝐱𝑗 = 𝑦𝑗
• Difficulty: ℋ is very large (all possible lines in 2D plane)
• Start with a guess 𝑔0 , and ‘correct’ or ‘improvise’ with data points
𝑤0
• 𝑔0 is represented by the weight vector 𝐰0 = 𝑤1
𝑤2 0
46
Generating the input data
• Fetch the ideal, linearly classified data from somewhere
• (most real data are noisy → no clear separation possible)
47
Algorithm
• Generate a set of points (x’s) in x-y plane (D)
• Pick a random line ax + by + c = 0, and label (y’s) using sign function
• The ideal separator ‘f ’
• start from some w0 (say, 0), and ‘correct’ its mistakes on D
• Run the following steps
Weights
vector
input vector of
1 data point
48
https://www.csie.ntu.edu.tw/~htlin/mooc/
More
Situation/Label: ‘YES’
Weights
Input
vector
vector
Situation/Label: ‘NO’
Geometrically, Wt+1
49
• For a given pair of points (xa, xb), check if xb is above or below the line
50
When do we stop?
• no mistake or a misclassification remains in the given data-set.
• Naïve cyclic
• Random cyclic
51
Execution
Initial 8
# misclass. pts 5 iter 0 loc 2
# misclass. pts 3 iter 1 loc 1
# misclass. pts 3 iter 2 loc 6
# misclass. pts 0 iter 3 loc 3
total iterations = 3 54
Sequences
57
Summary
• Structure of Perceptron Learning Hypothesis
• Algorithm
• 1D and 2D examples
• Homework on 1D
58
Homework
• Write a program to classify a given data
𝑤0
• Understand PLA
0
• Write a 1D (ℝ1) code if you are beginner to python
• Start with a synthetic data (few points labelled +1 and rest -1)
• Need two arrays – one for x points and other for y labels
• Start a guess – perhaps 0 or any random value.
• Update the weights
• Calculate the number of misclassified points at every iteration and print
59
Question
• Let us try to think why PLA may work at all times. Let n = n(t), according to the
rule of PLA, i.e.,
61
Question
• Given a linear separable D, does PLA always stop correctly?
62
PLA aligns with the unknown, ideal classifier
• linear separable D , exists perfect wf such that yn = sign(wTf xn)
64
https://www.csie.ntu.edu.tw/~htlin/mooc/
Upper bound for T
(Eq. 1)
𝐰𝑓𝑇
• Define 𝑅2 = max 𝐱𝐧 2 and 𝜌 = max 𝑦𝑛 𝐱𝑛
𝑛 𝑛 𝐰𝑓
• Express the upper bound ‘T’ in terms of these two measures
𝑅2
• T≤
ρ2
• The max. value of LHS in Eq. 1 above is 1. Since T corrections increase the
inner product by sqrt(T). Constant is less than 1, max. number of corrected
mistakes is 1/const2
• Check this T with the model you simulate 65
PLA - Summary
• Guaranteed classification if D is linearly separable
• Weights updated based on misclassification
• Inner product of wf and wt grows fast
• Length of wt grows slow
• Stopping criteria → alignment with wf.
𝐰𝑓𝑇
• 𝜌= max 𝑦𝑛 𝐱𝑛 depends on wf , hence no. of steps not guaranteed.
𝑛 𝐰𝑓
66
Noisy data
+ noise
68
https://www.csie.ntu.edu.tw/~htlin/mooc/
Line with noise tolerance
• ‘little’ noise → small fraction of points
cannot be linearly separated
• NP-hard to solve
69
Summary
• PLA → Hyperplanes as linear classifiers in Rd
70