0% found this document useful (0 votes)
9 views70 pages

Lecture 03 Bayes Classifier With Prob Concepts

The document outlines the ME228 course on Applied Data Science and Machine Learning, covering key concepts such as supervised, unsupervised, and reinforcement learning. It discusses classification problems, various classifiers including Bayes Classifier, K-nearest neighbors, and Linear Discriminant Analysis, along with their applications and performance metrics. Additionally, it includes practical implementations using Python's sklearn library and homework assignments for students.

Uploaded by

anrdhtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views70 pages

Lecture 03 Bayes Classifier With Prob Concepts

The document outlines the ME228 course on Applied Data Science and Machine Learning, covering key concepts such as supervised, unsupervised, and reinforcement learning. It discusses classification problems, various classifiers including Bayes Classifier, K-nearest neighbors, and Linear Discriminant Analysis, along with their applications and performance metrics. Additionally, it includes practical implementations using Python's sklearn library and homework assignments for students.

Uploaded by

anrdhtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

ME228: Applied Data Science and

Machine Learning

Instructors: Neeraj Kumbhakarna and Shyam Karagadde


Department of Mechanical Engineering
Email: neeraj_k@iitb.ac.in , s.Karagadde@iitb.ac.in

1
Definition
machine learning: improving some performance measure with experience
computed from the data

data → ML→ improved performance measure

2
Recap

• Supervised learning

• Unsupervised learning

• Reinforcement learning

3
4
https://www.frontiersin.org/articles/10.3389/fphar.2021.720694/full
Today:
• Learning to answer Yes/No → CLASSIFICATION Problem

• Bayes Classifier

• K-nearest neighbours

• Perceptron model for linearly separable data


• Hypothesis Set
• Learning Algorithm

• Linearly Non-separable input data


• ‘Little’ noise → Pocket Algorithm
• Strong non-linearity

5
Examples

• Facial or Fingerprint recognition system to allow entry (yes/no)

• Spam identification

• Soundness of a material / engineered component based on one or more input


parameters

• Determining handwritten numbers / characters (multi-class yes/no)

6
Classification problems
• The response Y to a feature X (either a scalar or a vector) is qualitative

• A quantitative response often has a real valued target associated with a


feature – such as marks obtained by everyone here in a course
• Qualitative measure: Grade assigned: A, B, C (multi-class), or Yes-No

• Goals of a classification problem:


• Build a classifier C(X) that assigns a class label from the set C to a future unlabeled
observation X.
• Assess the uncertainty in each classification
• Understand the role of each feature X1…. Xd.

7
Bayes Classification with Domain Knowledge
A basic ML problem (essentially probability estimation in this case)

• The salmon available in a given market are 70% freshwater and 30%
seawater fish. You decide to model their lengths using different normal
distributions.
• The local fish expert, Mr. Know-It-All, tells you that the lengths of freshwater
salmon are normally distributed with a mean of 27 inches and standard
deviation of 1 inch, whereas the lengths of seawater salmon are normally
distributed with a mean of 30 inches and standard deviation of 1 inch.
• You randomly select a salmon from the market and find it to be 29 inches
long. What is the probability that the selected salmon is a freshwater fish?

8
Bayes Rule

• A university institutes an infectious disease testing program in which students are tested weekly with a
rapid test and those who test positive for the disease are quarantined immediately. Unfortunately, tests
are not completely accurate and the results are accurate 98% of the time whether or not the student has
the disease. The university is very successful in its testing program, as a result of which only 1% of the
students taking a test on any given day actually have the disease (whether or not they test positive).
What is the probability that a quarantined student has the disease?

9
Model via Normal Distribution

C. A. Aggarwal, Prob. & Stat. for ML 10


Calculation methodology

• Prior probability (via distributions) known/given


• In general, this is like having a trained ML model (here model being the prob.
Distribution)
• Take a test data set (x,y), where x is the feature, y is the label.
• For each data point, estimate the posterior probability using Bayes rule, and
assign the most likely class corresponding to p => 0.5
• Report number of misclassifications or confusion matrix

11
Performance of classification
• Performance of the model C is measured using misclassification error rate:

• i.e. the average number of misclassifications based on the labelled data given
to us

• Confusion matrix

• Error metrics

12
LDA: Linear Discriminant Analysis
• Also known as Normal discriminant analysis (NDA)

• solve multi-class classification problems

• LDA separates multiple classes with multiple features through data


dimensionality reduction

• LDA algorithms model the data distribution for each class and use Bayes'
theorem

• Usually very quick, works well for problems having a fewer number of
features

13
• Bayes calculates conditional probabilities—the probability of an event given some
other event has occurred

• LDA algorithms make predictions by using Bayes to calculate the probability of


whether an input data set will belong to a particular output

• LDA does this by projecting data with two or more dimensions into one dimension so
that it can be more easily classified.
• The technique is, therefore, sometimes referred to as dimensionality reduction

14
https://www.ibm.com/topics/linear-discriminant-analysis
Linear Discriminant Analysis
Bayes’ theorem:

• Classify an observation to the group for which the probability is highest


• fk is not known easily
• How can this be estimated?

• For a single variable problem:

• Substitute above in Bayes’:

• Taking log and rearranging,


• The discriminant function δ (evaluated for each class) is required to be largest for
assigning x to a class

𝛿𝑚 ≈ 𝛿𝑗 represents the decision boundary of LDA


15
Algorithm
• Obtain sample means (for each class) and variance 𝜇𝑘 and 𝜎
• Formula in previous white board

• For an observation x*, estimate the value of the discriminant 𝛿 for each class

• Assign class ‘j’ to the observation x* if max (𝛿𝑖 ) = 𝛿𝑗


𝑖=1:𝐾

16
LDA vs Bayes REFER to whiteboard notes for examples and equations

LDA Bayes

20 observations were drawn from each of the two classes, and


Bayes decision boundary – obtained are shown as histograms
from a large no. of observations LDA is quite close to ideal Bayes decision boundary!
Works well for relatively fewer N input data
17
Introduction to Statistical Learning with Python, 2023
Multivariate LDA

Uncorrelated features

Introduction to Statistical Learning with Python, 2023 Correlated features 18


Bayes classifier Multivariate LDA (solid line)

19
Introduction to Statistical Learning with Python, 2023
sklearn implementation - LDA
import numpy as np
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))

Detailed example:

https://scikit-learn.org/stable/auto_examples/classification/plot_lda.html#sphx-glr-auto-
examples-classification-plot-lda-py

20
sklearn implementation - KNN

samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([[1., 1., 1.]]))

X = [[0., 1., 0.], [1., 0., 1.]]


neigh.kneighbors(X, return_distance=False)

from sklearn.metrics import confusion_matrix


y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

21
Confusion Matrix
• Possible outcomes when a classifier model is applied to a labelled data

• Average or rate of errors

22
Introduction to Statistical Learning with Python, 2023
Performance metric: ROC curve
• Receiver Operating Characteristics
• (from communications theory)

• Larger the area under curve, better the


classification

• The ideal ROC curve hugs the top left


corner, indicating a high true positive rate
and a low false positive rate.
• The dotted line represents the “no
information” classifier; this is what we would
expect if student status and credit card
balance are not associated with probability
of default

23
Introduction to Statistical Learning with Python, 2023
Naïve Bayes
• This method presents another approach to estimate fk (that is different from LDA) → ‘naïve’
because it’s coming from a “naïve” approximation of f.
• Instead of assuming that these functions belong to a particular family of distributions (e.g.
multivariate normal), we assume that:
• “Within the kth class, the d features are independent”
• 𝑓𝑘 𝑥 = 𝑓𝑘1 𝑥1 × 𝑓𝑘2 𝑥2 × ⋯ 𝑓𝑘𝑑 𝑥𝑑

• This is a fairly big assumption – ignores joint distribution between the features
• i.e. let go of off-diagonal elements in covariance matrix

• The posterior probability is therefore evaluated as: (note the subscript ‘p instead of usual ‘d’
for no. of features)

• Each f can be obtained through any probability distribution function that relates to the data

24
Estimating f ’s

• Xj is quantitative, then we can assume that Xj|Y = k ∼ N(µjk, σjk2)

• Xj is quantitative, then another option is to use a non-parametric estimate for fkj


• Recall marginal and joint probability distribution functions

• Xj is qualitative, then simply count the proportion of training observations for


the jth feature corresponding to each class.
• Xj ∈ {1, 2, 3}, and we have 100 observations in kth class
• If the jth feature takes on values of X with counts 32, 55, and 13

25
Example
• 2 classes, 3 features out of which 2 are quantitative and 1 qualitative
• Let’s consider 𝜋1 = 𝜋2 = 0.5 (equal no. of responses in 2 classes)
• Density functions: fkj: for k = 1,2 and j = 1,2,3 displayed below.

Classify a new observation, x* = (0.4, 1.5, 1)


Both directly
𝑓11 0.4 = 0.368, 𝑓12 1.5 = 0.484
from
distribution –
𝑓21 0.4 = 0.030, 𝑓22 1.5 = 0.13
code/tables
𝑓13 1 = 0.226, 𝑓23 1 = 0.616 → from counting of
qualitative feature response

The given observation x* has 94.4% posterior probability of


belonging to the first class
26
Introduction to Statistical Learning with Python, 2023
Implementation https://scikit-learn.org/stable/modules/naive_bayes.html

• from sklearn.datasets import load_iris


• from sklearn.model_selection import train_test_split
• from sklearn.naive_bayes import GaussianNB
• X, y = load_iris(return_X_y=True)
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=0)
• gnb = GaussianNB()
• y_pred = gnb.fit(X_train, y_train).predict(X_test)
• print("Number of mislabeled points out of a total %d points : %d"
• % (X_test.shape[0], (y_test != y_pred).sum()))

27
Comparison

LDA Naïve Bayes

LDA has a slightly lower overall error rate


NB correctly predicts a higher fraction of the true defaulters

NB is expected work better where no of features ‘d’ is larger


or no. of data points ‘N’ is smaller

28
Introduction to Statistical Learning with Python, 2023
K Nearest Neighbours

Binary (2 class)
problem

Method:
KNN decision
boundary
Let us take K = 3
Surprisingly close
Green circle to
to Bayes classifer!
capture 3 points
around the test
point x.

If number fraction
of points of 1 class
greater than other,
assign that to x.

29
Introduction to Statistical Learning with Python, 2023
K-nearest neighbours
• Labelled data: (x1, x2, k)
• k = {1,0} binary and
• k = {1,2,3..m} multi-class
• Consider any point (x1, x2)
from the data set

• Observe how many “K”


neigbours exist. Count the
number of points amongst
K that belong to a given
class k. Assign x1,x2 =
class k.

30
Decision boundary
• Line that separates the
two labels

• A point on the decision


boundary is expected to
have equal number of
points belonging to both
classes

31
K = 10

Algorithm:

➢ Define K
➢ Obtain the response labels
of K nearest neighbours for
an observation x*
➢ Evaluate number fractions
➢ Assign class ‘j’ for which
the number fraction of
class j responses in the
region No is the most (i.e.
Pr > 0.5 from below
equation)

32
Overfit and underfit models

Fits all points (no misclassification) Fits a lot fewer points


33
Error rate

34
sklearn implementation - KNN

samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([[1., 1., 1.]]))

X = [[0., 1., 0.], [1., 0., 1.]]


neigh.kneighbors(X, return_distance=False)

from sklearn.metrics import confusion_matrix


y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

35
KNN
• A simple and intuitive way to classify

• Flexible
• (read more from ISLP Book)

• Can we think of a much simpler separator, such as a line?


• Such that, the two classes lie on either side of the line.

• Such a simple consideration is also helpful to establish a few theoretical


considerations

• E.g. Linear separator for classification using Perceptron Learning

36
Homework on Application
• Use the iris flower data and develop following classifier models

• 1. Linear Discriminant Analysis


• 2. K-nearest neighbors (KNN)

• Plot the decision boundary on the 2D plot


• Print the confusion matrix on the test data
• Plot error rate vs K for KNN only

Conceptual practice problems uploaded on moodle


https://moodle.iitb.ac.in/mod/resource/view.php?id=26287&forceview=1

37
Perceptron Learning Algorithm
• Methods so far (Bayes, LDA, KNN) are more of assigning a data point to a
class based on set of rules

• A more intuitive “learning” approach can be a possible

• What do we mean by learning: As more data points arrive, one case establish
an improved performance measure.

38
Example
• Learning logical AND gate • Credit card approval

X1 X2 Y age 23 years
0 0 0 gender female
annual salary INR 10,00,000
0 1 0
year in residence 1 year
1 0 0
year in job 0.5 year
1 1 1
current debt 2,00,000

Decide 1 or 0? (Yes/No) Decide whether to award? (Yes/No)

Choosing a good hypothesis set?

39
‘Perceptron’
• Inputs: 𝐱 = (𝑥1 , 𝑥2 , 𝑥3 … . . 𝑥𝑑 )
• For AND gate: d=2
• Credit card: d=5

• For each case/customer, there is a label (based on history or data)


• Υ: 1 𝑔𝑜𝑜𝑑 , −1 𝑏𝑎𝑑 , → linear formula ℎ 𝜖 ℋ are:

• Compute a weighted score


• ℎ x = sign σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 − threshold

• Weights: UNKNOWNS that are to be learned

40
Vector form

• Note that same can be


defined for the AND gate:

• h(x) = 0.5(sign(wTx) + 1)
• h(x) = 1 if sign(wTx) = 1
• h(x) = 0 if sign(wTx) = -1

41
Inner product
• each ‘tall’ w represents a hypothesis h & is multiplied with ‘tall’ x
• Stick to ‘tall’ versions of vectors to simplify notation

Recall:

=𝐰∙𝐱= w

(inner / dot product) x Scalar value

Recall:

42
How do perceptrons look like?
• In ℝ1 : ℎ x = sign −|𝑤0 | + 𝑤1𝑥 = sign 𝑤0 𝑤1 ∙ 1
𝑥
𝑤0

Misclassification
𝑤0

43
How do perceptrons look like?
• In ℝ2 : ℎ x = sign 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

• Features x: points on the plane (or in ℝ𝑑 )


• Labels y: +1 and -1
• Hypothesis h: All possible lines in ℝ2 (hyperplanes in ℝ𝑑 )

• Different line classifies x differently

Hypothesis – 2
(happens to be correct
Hypothesis - 1 classification)

44
Question
Task: Identify spam SMS.
• Assume each email in your sso-webmail is represented by frequency of keyword occurrence.
Output of +1 → spam. Which of the following sets of features shall have large positive
weights for a perceptron model?
1. Coffee, Tea, Samosa, Dosa
2. Free, deal, save, discount
3. Machine, statistics, data, book
4. Moodi, techfest, e-summit, freshman

45
How do we arrive at the correct hypothesis g
• ℋ = all possible perceptrons, 𝑔 =?

• Need: 𝑔 ≈ 𝑓
• Ideally: 𝑔 𝐱𝑗 = 𝑓 𝐱𝑗 = 𝑦𝑗
• Difficulty: ℋ is very large (all possible lines in 2D plane)
• Start with a guess 𝑔0 , and ‘correct’ or ‘improvise’ with data points

𝑤0
• 𝑔0 is represented by the weight vector 𝐰0 = 𝑤1
𝑤2 0

46
Generating the input data
• Fetch the ideal, linearly classified data from somewhere
• (most real data are noisy → no clear separation possible)

• Generate a set of points (x’s) in x-y plane (D)


• Pick a random line ax + by + c = 0, and label (y’s) using sign function

• So you have a labelled data

47
Algorithm
• Generate a set of points (x’s) in x-y plane (D)
• Pick a random line ax + by + c = 0, and label (y’s) using sign function
• The ideal separator ‘f ’
• start from some w0 (say, 0), and ‘correct’ its mistakes on D
• Run the following steps

Weights
vector

input vector of
1 data point

48
https://www.csie.ntu.edu.tw/~htlin/mooc/
More

Situation/Label: ‘YES’

Weights
Input
vector
vector

Situation/Label: ‘NO’

Geometrically, Wt+1
49

Represents the angle


between the two vectors
49
Setting up
• Generate N random points in x1-x2 plane: x1 ∈ −1,1 𝑎𝑛𝑑x2 ∈ −1,1

• Draw a line 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0, with weights randomly generated


• (wTx = 0 → w’s are normal to the separating line (classifier))

• For a given pair of points (xa, xb), check if xb is above or below the line

• Generate labels y (-1 or 1) accordingly

50
When do we stop?
• no mistake or a misclassification remains in the given data-set.
• Naïve cyclic
• Random cyclic

• Learning is ‘complete’ when g ~ f


• On D, if halted, yes.
• What do we say when data is from outside D?
• If not halting, what are the ways to end the program?

51
Execution

Initial labelled data Final classification (weights of


line shown) obtained by PLA 52
2nd example No. misclass. pts 4 iter 0 loc 7
No. misclass. pts 4 iter 1 loc 3
No. misclass. pts 5 iter 2 loc 1
No. misclass. pts 5 iter 3 loc 2
No. misclass. pts 5 iter 4 loc 1
No. misclass. pts 5 iter 5 loc 8
No. misclass. pts 2 iter 6 loc 4
No. misclass. pts 2 iter 7 loc 5
No. misclass. pts 2 iter 8 loc 2
No. misclass. pts 2 iter 9 loc 2
No. misclass. pts 2 iter 10 loc 4

No. misclass. pts 1 iter 41 loc 8
No. misclass. pts 1 iter 42 loc 1
No. misclass. pts 1 iter 43 loc 2
No. misclass. pts 1 iter 44 loc 7
No. misclass. pts 1 iter 45 loc 0
No. misclass. pts 1 iter 46 loc 7
No. misclass. pts 1 iter 47 loc 0
Final classification (weights - No. misclass. pts 1 iter 48 loc 8
Initial labelled data
blue line) obtained by PLA No. misclass. pts 1 iter 49 loc 7
No. misclass. pts 0 iter 50 loc 4
total iterations = 50 53
Sequences – 3rd example (3 iterations)
Initial Iteration 1 Iteration 3

Initial 8
# misclass. pts 5 iter 0 loc 2
# misclass. pts 3 iter 1 loc 1
# misclass. pts 3 iter 2 loc 6
# misclass. pts 0 iter 3 loc 3
total iterations = 3 54
Sequences

Note: All lines


drawn as vectors
starting from
origin (0,0) at the
centre

Previous weight vector (normal to classifier) Update term (ynxn(t))


New weight vector
Point (vector) being used for updating weights 55
(random every time) https://www.csie.ntu.edu.tw/~htlin/mooc/
Contd…

Previous weight vector (normal to classifier) Update term (ynxn(t))


New weight vector
Point (vector) being used for updating weights 56
(random every time) https://www.csie.ntu.edu.tw/~htlin/mooc/
Question
• Let us try to think why PLA may work at all times. Let n = n(t), according to the
rule of PLA, i.e.,

• Which formula is true?


1. wTt+1xn = yn
2. sign(wTt+1xn) = yn
3. yn (wTt+1xn) >= yn wTtxn
4. yn (wTt+1xn) < yn wTtxn

57
Summary
• Structure of Perceptron Learning Hypothesis

• Summation of weights and inputs to decide the ‘cut-off score’

• Algorithm

• 1D and 2D examples

• Homework on 1D

58
Homework
• Write a program to classify a given data
𝑤0

• Understand PLA
0
• Write a 1D (ℝ1) code if you are beginner to python

• Start with a synthetic data (few points labelled +1 and rest -1)
• Need two arrays – one for x points and other for y labels
• Start a guess – perhaps 0 or any random value.
• Update the weights
• Calculate the number of misclassified points at every iteration and print

59
Question
• Let us try to think why PLA may work at all times. Let n = n(t), according to the
rule of PLA, i.e.,

• Which formula is true?


1. wTt+1xn = yn
2. sign(wTt+1xn) = yn
3. yn (wTt+1xn) >= yn wTtxn
4. yn (wTt+1xn) < yn wTtxn

Answer: 3 (yn (wTt+1xn) >= yn wTtxn)

Multiply the rule by ynxn →


Rule tries to correct the mistake (the next classifier accommodates the point xn) 60
• Show that y(t)(wT(t)x(t)) < 0

61
Question
• Given a linear separable D, does PLA always stop correctly?

62
PLA aligns with the unknown, ideal classifier
• linear separable D , exists perfect wf such that yn = sign(wTf xn)

• wf perfect → every xn is correctly away from line:

• wTf wn increases by updating with any (xn(t),yn(t))


Analogy:
Angle between wf and wt+1
smaller than wf and wt

Whenever, it finds a misclassified point, wt appears more aligned with wf.


63
https://www.csie.ntu.edu.tw/~htlin/mooc/
Classifier evolves slowly
• Notice that the extent of update of ‘w(t+1)’ depends on how far xn is

This implies that wt grows slowly

64
https://www.csie.ntu.edu.tw/~htlin/mooc/
Upper bound for T

(Eq. 1)

𝐰𝑓𝑇
• Define 𝑅2 = max 𝐱𝐧 2 and 𝜌 = max 𝑦𝑛 𝐱𝑛
𝑛 𝑛 𝐰𝑓
• Express the upper bound ‘T’ in terms of these two measures

𝑅2
• T≤
ρ2

• The max. value of LHS in Eq. 1 above is 1. Since T corrections increase the
inner product by sqrt(T). Constant is less than 1, max. number of corrected
mistakes is 1/const2
• Check this T with the model you simulate 65
PLA - Summary
• Guaranteed classification if D is linearly separable
• Weights updated based on misclassification
• Inner product of wf and wt grows fast
• Length of wt grows slow
• Stopping criteria → alignment with wf.

• Simple to implement, even for any dimension d of D.

𝐰𝑓𝑇
• 𝜌= max 𝑦𝑛 𝐱𝑛 depends on wf , hence no. of steps not guaranteed.
𝑛 𝐰𝑓

66
Noisy data

+ noise

how to at least get g ≈ f on noisy D?


67
Linear separability
• PLA is completed → (no misclassified points remaining)

• D allows some set of weights (w’s) to classify correctly

• All such Ds can be termed ‘Linearly Separable

68
https://www.csie.ntu.edu.tw/~htlin/mooc/
Line with noise tolerance
• ‘little’ noise → small fraction of points
cannot be linearly separated

• NP-hard to solve

• Can PLA be modified to get an approximate


g?

69
Summary
• PLA → Hyperplanes as linear classifiers in Rd

• Corrections based on misclassifiers

• Perfect / Guranteed separation if linear

70

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy