Unit 2 .Statistical Decision Making-1
Unit 2 .Statistical Decision Making-1
Unit 2 .Statistical Decision Making-1
Mean of X and Y ??
Compute x-x(mean) and y-y(mean)
Apply Covariance formula
So
P[A ∩ B] = P[B].P[A|B] = P[A].P[B|A]
or
P[B].P[A|B] = P[A].P[B|A]
X, P(X)
This is the Prob. of any vector X being assigned to class wi.
Example for Bayes Rule/ Theorem
• Given Bayes' Rule :
Example1:
• It is given by P(King/Face) =
P(Face/King) * P(King)/ P(Face)
= 1 * (4/52) / (12/52)
= 1/3
Example2:
Prob. of having a fever, given that a person has a cold is, P(f|C) = 0.4.
Overall prob. of fever P(f) = 0.02.
Then using Bayes Th., the Prob. that a person has a cold, given that she (or he)
has a fever is:
Generalized Bayes Theorem
• Consider we have 3 classes A1, A2 and A3.
• Area under Red box is the sample space
• Consider they are mutually exclusive and
collectively exhaustive.
• Mutually exclusive means, if one event occurs then
another event cannot happen.
• Collectively exhaustive means, if we combine all the probabilities, i.e P(A1), P(A2)
and P(A3), it gives the sample space, i.e the total rectangular red coloured space.
• Consider now another event B occurs over A1,A2 and A3.
• Some area of B is common with A1, and A2 and A3.
• It is as shown in the figure below:
P(B) = ?
• Portion common with A1 and B is shown by:
• Portion common with A2 and B is given by :
• Portion common with A3 and B is given by:
• Represented by:
So.. Given Problem can be represented as:
Example-4.
Given 1% of people have a certain genetic defect. (It means 99% don’t have genetic defect)
90% of tests on the genetic defected people, the defect/disease is found positive(true positives).
9.6% of the tests (on non diseased people) are false positives
A = chance of having the genetic defect. That was given in the question as 1%. (P(A) = 0.01)
That also means the probability of not having the gene (~A) is 99%. (P(~A) = 0.99)
X = A positive test result.
P(A|X) = Probability of having the genetic defect given a positive test result. (To be computed)
P(X|A) = Chance of a positive test result given that the person actually has the genetic defect = 90%. (0.90)
p(X|~A) = Chance of a positive test if the person doesn’t have the genetic defect. That was given in the question as 9.6% (0.096)
Now we have all of the information, we need to put into the
equation:
• P(W)=0.01
• P(~W)=0.99
• P(PT|W)=0.9
• P(PT|~W)=0.08 Compute P(testing positive)
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
Example-6
A disease occurs in 0.5% of the population
(5% is 5/10% removing % (5/10)/100=0.005)
What is the probability of them having the disease, given a positive result?
•
Independence
• Independent random variables: Two random variables X and Y are said to be statistically
independent if and only if :
• p(x,y) = p(x).p(y)
• X=height and Y=Weight are joint probabilities are not independent… usually they are
dependent.
• Independence is equivalent to saying
• P(y|x) = P(y) or
• P(x|y) = P(x)
Conditional Independence
• Two random variables X and Y are said to be independent given Z if and only
if
– Height is less indicates age is less and hence vocabulary might vary.
– So Vocabulary is dependent on height.
• Working:
Air-Traffic Data
Days Season Fog Rain Class
Weekday Spring None None On Time
Weekday Winter None Slight On Time
Weekday Winter None None On Time
Holiday Winter High Slight Late
Saturday Summer Normal None On Time
Weekday Autumn Normal None Very Late
Holiday Summer High Slight On Time
Sunday Summer Normal None On Time
Weekday Winter High Heavy Very Late
Weekday Summer None Slight On Time
Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
F
o High 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1
g
Normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
None 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
R
ai Slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0
n
Heavy 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05
Naïve Bayesian Classifier
Instance:
Week Winter High Heavy ???
Day
• (naïve means: all are equal and independent: all the attributes
will have equal weightage and are independent)
Application of Naïve Bayes Classifier for NLP
• Consider the following sentences:
– S1 : The food is Delicious : Liked
– S2 : The food is Bad : Not Liked
– S3 : Bad food : Not Liked
– Given a new sentence, whether it can be classified as liked sentence or not liked.
● The value of kernel function, which is the density, can not be negative, K(u) ≥
0 for all −∞ < u < ∞.
Kernel Density Estimate
Method
Choose appropriate Kernels
Centre the Kernel on each of the data point
The density consists of a (normalized) sum of all overlapping kernel values at the
given point
Sample 1 2 3 4 5 6 In Summary KDE
Value -2.1 -1.3 -0.4 1.9 5.1 6.2 Use regions centered at the
data points
Allow the regions to overlap
Let each individual contribute a total
density of 1/N
With gaussian kernel, regions have
soft edges to avoid discontinuity
KERNEL DENSITY ESTIMATION
Similarity and Dissimilarity
• Distance are used to measure similarity
• There are many ways to measure the distance s between two instances
Distance or similarity measures are essential in solving many pattern recognition problems such
as classification and clustering.
Various distance/similarity measures are available in the literature to compare two data
distributions.
As the names suggest, a similarity measures how close two distributions are.
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the distance
between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based on
similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.
• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||
d1 = 3 2 0 5 0 0 0 2 0 0 ; d2 = 1 0 0 0 0 0 0 1 0 2
d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
= (5/(6.481*2.245)) = 0.3150
A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.
X X y
y z
Euclidean Distance
Distance Matrix
Minkowski Distance
Distance Matrix
Summary of Distance Metrics
• Manhattan Distance •
|X1-X2| + |Y1-Y2|
Nearest Neighbors Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute Distance
Test Record
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test data
class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to classify…
Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance from test sample to all
the data points in the training set.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P1 P2 P3 P4
Euclidean
Distance of
P5(3,7) from
? ? ? ?
P1 P2 P3 P4
Rule of thumb:
K = sqrt(N)
N: number of
training points
Choosing K Value
• There is no structured method to find the best value for “K”. We need to find
out with various values by trial and error and assuming that training data is
unknown.
• Choosing smaller values for K can be noisy and will have a higher influence on
the result.
• Larger values of K will have smoother decision boundaries which mean lower
variance but increased bias. Also, computationally expensive.
• In general, practice, choosing the value of k is k = sqrt(N) where N stands for
the number of samples in your training dataset.
• Try and keep the value of k odd in order to avoid confusion between two
classes of data
Algorithm
• Step-1: Select the value for K neighbors
• Step-2: Calculate the Euclidean distance from the test sample to every
sample point in the dataset.
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points
in each category.
• Step-5: Assign the new data points to that
category for which the number of the neighbor is
maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class
labels, using K nearest neighbour method classify a new unknown sample using k =3
and k = 2.
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN
P1 P2 P3 P4
● It is simple to implement.
● It is robust to the noisy training data
● It can be more effective if the training data is large.
● Always needs to determine the value of K which may be complex some time.
● The computation cost is high because of calculating the distance between the
test data point and all the samples present in the dataset.
Applications of KNN
Banking System : KNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have the
characteristics similar to the defaulters one?
Credit Ratings : KNN algorithms can be used to find an individual’s credit rating by
comparing with the persons having similar traits.
Politics : With the help of KNN algorithms, we can classify a potential voter
into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party
‘Congress’, “Will Vote to Party ‘BJP’.
KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image
Recognition and Video Recognition.