Unit 2 .Statistical Decision Making-1

Unit-2
Statistical Decision Making

Syllabus
– Introduction, Bayes Theorem, Multiple Features, Conditionally Independent
Features, Decision Boundaries.
– Decision Tree: Information gain, Entropy, Gini-index, building Decision Tree with
example problems.
– Nonparametric Decision Making: Introduction, Histograms, kernel and Window
Estimators, K-Nearest Neighbour Classification Techniques, Adaptive Decision
Boundaries.
• Statistical Estimation:
– Population is the entire data and a smaller set of data from the
population is called Sample.
– Usually it is very difficult to estimate on the entire data, it will be
performed on a sample data.
• Classification of Statistical Estimation:

– Parametric : It uses mean and SD.
– Non-Parametric : It does not uses mean and standard deviation.
Statistical / Parametric decision making
This refers to the situation in which we assume the general form of probability
distribution function or density function for each class.
• Statistical/Parametric Methods uses a fixed number of parameters to build the
model.
• Parametric methods are assumed to be a normal distribution.
• Parameters for using the normal distribution is –

Mean
Standard Deviation
• For each feature, we first estimate the mean and standard deviation of the feature
for each class.
•
Positive and negative covariance
• Positive Co variance: If temperature goes high sale of ice cream
also goes high. This is positive covariance. Relation is very close.
• On the other hand cold related disease is less as the temperature

increases. This is negative covariance.
• No co variance : Temperature and stock market links
Example: Two set of data X and Y
Mean of X and Y ??
Compute x-x(mean) and y-y(mean)
Apply Covariance formula
• Final result will be 35/5 = 7 = is a positive covariance

Statistical / Parametric Decision making - continued
• Parametric Methods can perform well in many situations but its performance is
at peak (top) when the spread of each group is different.
• Goal of most classification procedures is to estimate the probabilities that a
pattern to be classified belongs to various possible classes, based on the values
of some feature or set of features.
Ex1: To classify the fish on conveyor belt as salmon or sea bass
Ex2: To estimate the probabilities that a patient has various diseases given
some symptoms or lab tests. (Use laboratory parameters).
Ex3: Identify a person as Indian/Japanese based on statistical parameters
like height, face and nose structure.
• In most cases, we decide which is the most likely class.
• We need a mathematical decision making algorithm, to obtain classification or
decision.
Bayes Theorem
When the joint probability, P(A∩B), is hard to calculate or if the inverse or Bayes
probability, P(B|A), is easier to calculate then Bayes theorem can be applied.
Revisiting conditional probability

Suppose that we are interested in computing the probability of event A and we
have been told event B has occurred.
Then the conditional probability of A given B is defined to be:
• Original Sample space is the red coloured rectangular box.
• What is the probability of A occurring given sample space as B.
• Hence P(B) is in the denominator.
• And area in question is the intersection of A and B
and
From the above expressions, we can rewrite

P[A ∩ B] = P[B].P[A|B]
and P[A ∩ B] = P[A].P[B|A]
This can also be used to calculate P[A ∩ B]
So
P[A ∩ B] = P[B].P[A|B] = P[A].P[B|A]
or
P[B].P[A|B] = P[A].P[B|A]
P[A|B] = P[A].P[B|A] / P[B] - Bayes Rule

Bayes Theorem
Bayes Theorem:
The goal is to measure: P(wi |X)
Measured-conditioned or posteriori probability, from the above
three values.
P(X|w)
P(w) Bayes Rule P(wi|X)
X, P(X)
This is the Prob. of any vector X being assigned to class wi.
Example for Bayes Rule/ Theorem
• Given Bayes' Rule :
Example1:
Compute : Probability in the deck of cards (52 excluding jokers)

• Probability of (King/Face)
• It is given by P(King/Face) =
P(Face/King) * P(King)/ P(Face)
= 1 * (4/52) / (12/52)
= 1/3
Example2:
Cold (C) and not-cold (C’). Feature is fever (f).
Prior probability of a person having a cold, P(C) = 0.01.
Prob. of having a fever, given that a person has a cold is, P(f|C) = 0.4.
Overall prob. of fever P(f) = 0.02.
Then using Bayes Th., the Prob. that a person has a cold, given that she (or he)
has a fever is:
Generalized Bayes Theorem
• Consider we have 3 classes A1, A2 and A3.
• Area under Red box is the sample space
• Consider they are mutually exclusive and
collectively exhaustive.
• Mutually exclusive means, if one event occurs then
another event cannot happen.
• Collectively exhaustive means, if we combine all the probabilities, i.e P(A1), P(A2)
and P(A3), it gives the sample space, i.e the total rectangular red coloured space.
• Consider now another event B occurs over A1,A2 and A3.
• Some area of B is common with A1, and A2 and A3.
• It is as shown in the figure below:
P(B) = ?
• Portion common with A1 and B is shown by:
• Portion common with A2 and B is given by :
• Portion common with A3 and B is given by:
• Probability of B in total can be given by

• Remember :
• Equation from the previous slide:
• Replacing first in the second equation in this slide, we will get:

Further simplified P(B)
Arriving at Generalized version of Bayes theorem
Example 3: Problem on Bayes theorem with 3 class case
What is being asked
• While solving problem based on Bayes theorem, we need to split
the given information carefully:
• Asked is:
• Note, the flip of what is asked will be always given:
• It is found in the following statement :

• What else is given:
• Represented by:
So.. Given Problem can be represented as:
Example-4.
Given 1% of people have a certain genetic defect. (It means 99% don’t have genetic defect)
90% of tests on the genetic defected people, the defect/disease is found positive(true positives).
9.6% of the tests (on non diseased people) are false positives
If a person gets a positive test result,

what are the Probability that they actually have the genetic defect?
A = chance of having the genetic defect. That was given in the question as 1%. (P(A) = 0.01)
That also means the probability of not having the gene (~A) is 99%. (P(~A) = 0.99)
X = A positive test result.
P(A|X) = Probability of having the genetic defect given a positive test result. (To be computed)
P(X|A) = Chance of a positive test result given that the person actually has the genetic defect = 90%. (0.90)
p(X|~A) = Chance of a positive test if the person doesn’t have the genetic defect. That was given in the question as 9.6% (0.096)
Now we have all of the information, we need to put into the
equation:
P(A|X) = (.9 * .01) / (.9 * .01 + .096 * .99) = 0.0865 (8.65%).
The probability of having the faulty gene on the test is 8.65%.

Example - 5
Given the following statistics, what is the probability that a woman has
cancer if she has a positive mammogram result?
One percent of women over 50 have breast cancer.
Ninety percent of women who have breast cancer test positive on
mammograms.
Eight percent of women will have false positives.
Let women having cancer is W and ~W is women not having cancer.

Positive test result is PT.
Solution for Example 5
What is asked: what is the probability that a woman has cancer if she
has a positive mammogram result?
• P(W)=0.01
• P(~W)=0.99
• P(PT|W)=0.9
• P(PT|~W)=0.08 Compute P(testing positive)
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
Example-6
A disease occurs in 0.5% of the population
(5% is 5/10% removing % (5/10)/100=0.005)
A diagnostic test gives a positive result in:
◦ 99% of people with the disease

◦ 5% of people without the disease (false positive)
A person receives a positive result
What is the probability of them having the disease, given a positive result?
•
Independence
• Independent random variables: Two random variables X and Y are said to be statistically
independent if and only if :
• p(x,y) = p(x).p(y)
• Ex: Tossing two coins… are independent.

• Then the joint probability of these two will be product of their probability
• Another Example: X – Throw of dice, Y Toss of a coin
• (Event X and Y are joint probabilities and are independent)
• X=height and Y=Weight are joint probabilities are not independent… usually they are
dependent.
• Independence is equivalent to saying
• P(y|x) = P(y) or
• P(x|y) = P(x)
Conditional Independence
• Two random variables X and Y are said to be independent given Z if and only
if
• P(x,y|z)=P(x|z).P(y|z) : indicates that X and Y are independent given Z.
• Example: X: Throw a dice

Y: Toss a coin
Z: Card from deck
So X and Y are conditionally independent and Z also conditionally independent.

Joint Probability
• A joint probability is the probability of event Y occurring at the same
time that event X occurs. In order for joint probability to work, both
events must be independent of one another, which means they aren't
conditional.
• One is that events X and Y must happen at the same
time. Example: Throwing two dice simultaneously.
• The other is that events X and Y must be independent of each other.
That means the outcome of event X does not influence the outcome of
event Y.
Example: Rolling two Dice.
Joint probabilities are dependent but conditionally
independent
• Let us consider:
– X: height
– Y: Vocabulary
– Z: Age
– Height is less indicates age is less and hence vocabulary might vary.
– So Vocabulary is dependent on height.
– Further let us add a condition Z.

– If Age is fixed say 30, then consider samples of people with age 30, but now the
vocabulary of people with age 30 ..as the height increases vocabulary does not changes.
– So it is conditionally independent but joint probabilities are dependent without
condition.
Reverse:
• Two events are independent, but conditionally they are becoming dependent.
• Let us say X : Dice throw 1

• Y : Dice throw 2
•
• Basically they are independent.
• Let us add Z = sum of the dice

• Given Z and X value is fixed then Y value depends on X value.
• It is
• X is said to be orthogonal or perpendicular to y, given z.
• Naïve Bayes algorithm is a supervised learning algorithm, which
is based on Bayes theorem and used for solving classification
problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Example of Naïve Bayes Classifier
Solution
P(A|M)P(M) > P(A|N)P(N)

=> Mammals
Example. ‘Play Tennis’ data
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
• Working:
Air-Traffic Data
Days Season Fog Rain Class
Weekday Spring None None On Time
Weekday Winter None Slight On Time
Weekday Winter None None On Time
Holiday Winter High Slight Late
Saturday Summer Normal None On Time
Weekday Autumn Normal None Very Late
Holiday Summer High Slight On Time
Sunday Summer Normal None On Time
Weekday Winter High Heavy Very Late
Weekday Summer None Slight On Time
Cond. to next slide…

Air-Traffic Data
Cond. from previous slide…
Days Season Fog Rain Class
Saturday Spring High Heavy Cancelled
Weekday Summer High Slight On Time
Weekday Winter Normal None Late
Weekday Summer High None On Time
Weekday Winter Normal Heavy Very Late
Saturday Autumn High Slight On Time
Weekday Autumn None Heavy On Time
Holiday Spring Normal Slight On Time
Weekday Spring Normal None On Time
Weekday Spring Normal Heavy On Time
Naïve Bayesian Classifier
• Example: With reference to the Air Traffic Dataset mentioned earlier, let us
tabulate all the posterior and prior probabilities as shown below.
Class
Attribute On Time Late Very Late Cancelled
Weekday 9/14 = 0.64 ½ = 0.5 3/3 = 1 0/1 = 0
D Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1

ay Sunday 1/14 = 0.07 0/2 = 0 0/3 = 0 0/1 = 0
Holiday 2/14 = 0.14 0/2 = 0 0/3 = 0 0/1 = 0
Spring 4/14 = 0.29 0/2 = 0 0/3 = 0 0/1 = 0
S
e Summer 6/14 = 0.43 0/2 = 0 0/3 = 0 0/1 = 0
as
o Autumn 2/14 = 0.14 0/2 = 0 1/3= 0.33 0/1 = 0
n
Winter 2/14 = 0.14 2/2 = 1 2/3 = 0.67 0/1 = 0
Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
F
o High 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1
g
Normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
None 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
R
ai Slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0
n
Heavy 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05
Instance:
Week Winter High Heavy ???
Day
Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013

Case2: Class = Late
Case3: Class = Very Late :
Case4: Class = Cancelled :
Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013
Case2: Class = Late : 0.10 × 0.50 × 1.0 × 0.50 × 0.50 = 0.0125
Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222
Case4: Class = Cancelled : 0.05 × 0.0 × 0.0 × 1.0 × 1.0 = 0.0000
Case3 is the strongest; Hence correct classification is Very Late

• New Instance is H
• Naïve based classifier is very popular for document classifier
• (naïve means: all are equal and independent: all the attributes
will have equal weightage and are independent)
Application of Naïve Bayes Classifier for NLP
• Consider the following sentences:
– S1 : The food is Delicious : Liked
– S2 : The food is Bad : Not Liked
– S3 : Bad food : Not Liked
– Given a new sentence, whether it can be classified as liked sentence or not liked.
– Given Sentence: Delicious Food

• Remove stop words, then perform stemming
F1 F2 F3 0utput
Food Delicious Bad
• S1 1 1 0 1
• S2 1 0 1 0
• S3 1 0 1 0
ID3 Decision Tree
• P(Liked | attributes) = P(Delicious | Liked) * P(Food | Liked) *
P(Liked)
• =(1/1) * (1/1) *(1/3) = 0.33
• P(Not Liked | attributes) = P(Delicious | Not Liked) * P(Food | Not

Liked) * P(Not Liked)
• = (0)*(2/2)*(2/3) = 0
• Hence the given sentence belongs to Liked class
• Decision tree algorithms transform raw data to rule based
decision making trees.
• Herein, ID3 is one of the most common decision tree algorithm.
• Firstly, It was introduced in 1986 and it is acronym of Iterative
Dichotomiser.
• First of all, dichotomisation means dividing into two completely opposite
things.
• That’s why, the algorithm iteratively divides attributes into two groups
which are the most dominant attribute and others to construct a tree.
• Then, it calculates the entropy and information gains of each atrribute.
• In this way, the most dominant attribute can be founded. After then, the
most dominant one is put on the tree as decision node.
• Thereafter, entropy and gain scores would be calculated again among the
other attributes.
• Thus, the next most dominant attribute is found.
• Finally, this procedure continues until reaching a decision for that branch.
• That’s why, it is called Iterative Dichotomiser.
• So, we’ll mention the algorithm step by step.
• Definition: Entropy is the measures of impurity, disorder or
uncertainty in a bunch of examples.
• What an Entropy basically does?

• Entropy controls how a Decision Tree decides to split the data. It
actually effects how a Decision Tree draws its boundaries.
• The Equation of Entropy:
• What is Information gain and why it is matter in Decision Tree?
• Definition: Information gain (IG) measures how much “information” a feature
gives us about the class.
• Why it matter ?
• Information gain is the main key that is used by Decision Tree Algorithms to
construct a Decision Tree.
• Decision Trees algorithm will always tries to maximize Information gain.
• Information Gain is applied to quantify which feature provides maximal
information about the classification based on the notion of entropy.
• An attribute with highest Information gain will tested/split first.
• The Equation of Information gain:
For instance, the Following table informs about decision making factors to play tennis at outside for
previous 14 days.
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak NO
2 Sunny Hot High Strong NO
3 Overcast Hot High Weak YES
4 Rain Mild High Weak YES
5 Rain Cool Normal Weak YES
6 Rain Cool Normal Strong NO
7 Overcast Cool Normal Strong YES
8 Sunny Mild High Weak NO
9 Sunny Cool Normal Weak YES
10 Rain Mild Normal Weak YES
11 Sunny Mild Normal Strong YES
12 Overcast Mild High Strong YES
13 Overcast Hot Normal Weak YES
14 Rain Mild High Strong NO

• So, decision tree algorithms transform the raw data into rule based
mechanism.
• In this post, we have mentioned one of the most common decision tree
algorithm named as ID3.
• They can use nominal attributes whereas most of common machine
learning algorithms cannot.
• However, it is required to transform numeric attributes to nominal in ID3.
• Besides, its evolved version exists which can handle nominal data.
• Even though decision tree algorithms are powerful, they have long training
time.
• On the other hand, they tend to fall over-fitting.
• Besides, they have evolved versions named random forests which tend not
to fall over-fitting issue and have shorter training times.
Gini Index in Action
• Gini Index or Gini impurity measures the degree or probability of

a particular variable being wrongly classified when it is
randomly chosen.
• Gini Index, also known as Gini impurity, calculates the amount
of probability of a specific feature that is classified incorrectly
when selected randomly.
• Gini index varies between values 0 and 1, where 0 expresses the
purity of classification, i.e. All the elements belong to a specified
class or only one class exists there.
• And 1 indicates the random distribution of elements across
various classes.
• The value of 0.5 of the Gini Index shows an equal
distribution of elements over some classes.
• While designing the decision tree, the features possessing
the least value of the Gini Index would get preferred.
• Classification and Regression Tree (CART) algorithm
deploys the method of the Gini Index to originate
binary splits.
• In addition, decision tree algorithms exploit Information
Gain to divide a node and Gini Index or Entropy is the
passageway to weigh the Information Gain.
In parametric decision making, Only the parameters of the densities, such as their MEAN or
VARIANCE had to be estimated from the data before using them to estimate probabilities of class
membership.
These methods typically assume that the data follows a known Probability distribution, such as the
normal distribution, and estimate the parameters of this distribution using the available data.
The basic idea behind the Parametric method is that there is a set of fixed
parameters that are used to determine a probability model that is used in Machine
Learning as well
• Parameters for using the normal distribution are as follows:
• Mean
• Standard Deviation
Non-parametric methods are statistical techniques that do not rely on specific assumptions about
the underlying distribution of the population being studied.
NON-PARAMETRIC DECISION MAKING
In Nonparametric approach, distribution of data is not defined by a finite set of
parameters
Nonparametric model does not take a predetermined form but the model is
constructed according to information derived from the data.
It does not uses MEAN or VARIANCE.
Non-Parametric Decision making is considered as more robust.
Some of the popular Non – Parametric Decision making includes:
Histogram, Scatterplots or Tables of data
Kernel Density Estimation
KNN
Support Vector Machine (SVM)
•
HISTOGRAM Continued
• The counts, or frequencies of observations, in each bin are then
plotted as a bar graph with the bins on the x-axis and the
frequency on the y-axis.
• One of the thumb rule to choose the number of intervals to be
equal to the square root of the number of samples
Histogram Example
Histogram Example
Histogram Example
Non-Parametric Methods: Histogram Estimator
• Histogram is a convenient way to describe the data.
• To form a histogram, the data from a single class are grouped into
intervals.
• Over each interval rectangle is drawn, with height proportional to
number of data points falling in that interval. In the example
interval is chosen to have width of two units.
HISTOGRAM Continued
For Example : the height in the first row is 0.1/4 = 0.025

Continued …
Kernel and Window Estimators
• Kernel density estimation is the process of estimating an unknown
probability density function using a kernel function K(u)
• While a histogram counts the number of data points in somewhat arbitrary
regions, a kernel density estimate is a function defined as the sum of a
kernel function on every data point.
• Samples themselves can be thought of as very rough approximations to the true
density function, namely a set of spikes or delta functions, one for each sample
value with very small width and very large height such that the combined
area of all spikes is equal to one.
• Each Delta function is replaced by Kernel Functions
such as rectangles, triangles or normal density
functions which have been scaled so that their
combined area should be equal to one.
Kernel Density function
-4 to -2 = 1, -2 to 0 = 2, 0 to -2 = 1 -2 to -4 =0, -4 to -6=1 and -6 to -8=1

Height = 1/6*2 = 0.08 (first case) and so on
● The first property of a kernel function is that it must be symmetrical. This
means the values of kernel function is same for both +u and –u as shown in the
plot below. This can be mathematically expressed as K (-u) = K (+u). The
symmetric property of kernel function enables the maximum value of the
function (max(K(u)) to lie in the middle of the curve.
● The area under the curve of the function must be equal to

one. Mathematically, this property is expressed as
● The value of kernel function, which is the density, can not be negative, K(u) ≥
0 for all −∞ < u < ∞.
Kernel Density Estimate
Method
Choose appropriate Kernels
Centre the Kernel on each of the data point
The density consists of a (normalized) sum of all overlapping kernel values at the
given point
Sample 1 2 3 4 5 6 In Summary KDE
Value -2.1 -1.3 -0.4 1.9 5.1 6.2 Use regions centered at the
data points
Allow the regions to overlap
Let each individual contribute a total
density of 1/N
With gaussian kernel, regions have
soft edges to avoid discontinuity
KERNEL DENSITY ESTIMATION
Similarity and Dissimilarity
• Distance are used to measure similarity
• There are many ways to measure the distance s between two instances
Distance or similarity measures are essential in solving many pattern recognition problems such
as classification and clustering.
Various distance/similarity measures are available in the literature to compare two data
distributions.
As the names suggest, a similarity measures how close two distributions are.
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the distance
between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based on
similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.
• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
Compute the Euclidean Distance between the following data set

• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the
sum of the absolute differences of their Cartesian coordinates. In a simple way
of saying it is the total sum of the difference between the x-coordinates and y-
coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)
• In general Manhattan Distance = sum for i to N sum of |v1[i] – v2[i]|

Compute the Manhattan distance for the following
• D1 = [10, 20, 15, 10, 5]
• D2 = [12, 24, 18, 8, 7]
Manhattan distance:
is also popularly called city block distance
Euclidean distance is like flying

distance
Manhattan distance is like

travelling by car
Minkowski Distance
• It calculates the distance between two real-valued vectors.
• It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “r“, that allows different distance
measures to be calculated.
• The Minkowski distance measure is calculated as follows:
Minkowski is called generalization of Manhattan and Euclidean:
Manhattan Distance is called L1 Norm and

Euclidean distance is called L2
Minkowski is called Lp where P can be 1 or 2

Cosine Similarity
(widely used in recommendation system and NLP)
–If A and B are two document vectors.
–Cosine similarity ranges between (-1 to +1)
– -1 indicates not at all close and +1 indicates it is very close in similarity
–In cosine similarity data objects are treated as vectors.
– It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used
to
measure document similarity in text analysis.
–Cosine Distance = 1- Cosine Similarity
cos(A, B) = 1: exactly the same
0: orthogonal
−1: exactly opposite
Formula for Cosine Similarity
• The cosine similarity between two vectors is measured in
‘θ’.
• If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they
are similar.
• If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.
• If two points are on the same plane or same vector

• In the example given P1 and P2 are on the same vector,
and hence the angle between them is 0, so COS(0) =1
indicates they are of high similarity
• In this example two points P1 and P2
are separated by 45 degree, and hence

Cosine similarity is COS(45) = 0.53.
In this example P1 and P2 are separated by
90 degree, and hence the Cosine

similarity is COS(90)= 0
If P1 and P2 are on the opposite side
• If P1 and P2 are on the opposite side then the angle between
them is 180 degree and hence the COS(180)= -1
• If it is 270, then again it will be 0, and 360 or 0 it will be 1.

Cosine Similarity
Advantages of Cosine Similarity
• The cosine similarity is beneficial because even if the two
similar data objects are far apart by the Euclidean distance
because of the size, they could still have a smaller angle
between them. Smaller the angle, higher the similarity.
• When plotted on a multi-dimensional space, the cosine
similarity captures the orientation (the angle) of the data
objects and not the magnitude.
Example1 for computing cosine distance
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using
Cosine Similarity. (if angle can not be estimated directly)
The ‘x’ vector has values, x = { 3, 2, 0, 5 }

The ‘y’ vector has values, y = { 1, 0, 0, 0 }
The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1
∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49

Example2 for computing cosine distance
d1 = 3 2 0 5 0 0 0 2 0 0 ; d2 = 1 0 0 0 0 0 0 1 0 2
d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481

(square root of sum of squares of all the elements)
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
So cosine similarity = cos( d1, d2 ) = (d1 ∙ d2)/ (||d1|| *||d2|| )
= (5/(6.481*2.245)) = 0.3150
Cosine distance (or it can be called dis-similarity)

= 1-cos(d1,d2) = 1-0.3436 = 0.6564
Find Cosine distance between
D1 = [5 3 8 1 9 6 0 4 2 1] D2 = [1 0 3 6 4 5 2 0 0 1]
When to use Cosine Similarity
• Cosine similarity looks at the angle between two vectors, euclidian

similarity at the distance between two points. Hence it is very
popular for NLP applications.
• Let's say you are in an e-commerce setting and you want to

compare users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull
• By cosine similarity, user 1 and user 2 are more similar. By euclidean

similarity, user 3 is more similar to user 1.
JACCARD SIMILARITY AND DISTANCE:
In Jaccard similarity instead of vectors, we will be using sets.
It is used to find the similarity between two sets.
Jaccard similarity is defined as the intersection of sets divided by their
union. (count)
Jaccard similarity between two sets A and B is
A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.
• In this example it is 4/12 = 0.33

Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known

properties.
1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
• A distance that satisfies these properties is a metric, and a space is

called a metric space
Distance Metrics Continued
• Dist (x,y) >= 0
• Dist (x,y) = Dist (y,x) are Symmetric
• Detours can not Shorten Distance
Dist(x,z) <= Dist(x,y) + Dist (y,z)
X X y
y z
Euclidean Distance
Distance Matrix
Minkowski Distance
Distance Matrix
Summary of Distance Metrics
• Manhattan Distance •
|X1-X2| + |Y1-Y2|
Nearest Neighbors Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute Distance
Test Record
Training Records Choose k of the “nearest”

records
K-Nearest Neighbors (KNN) : ML algorithm
• Simple, but a very powerful classification algorithm
• Classifies based on a similarity measure
• This algorithm does not build a model
• Does not “learn” until the test example is submitted for classification
• Whenever we have a new data to classify, we find its K-nearest neighbors from
the training data
• Classified by “MAJORITY VOTES” for its neighbor classes
• Assigned to the most common class amongst its K-Nearest Neighbors
(by measuring “distant” between data)
• In practice, k is usually chosen to be odd, so as to avoid ties
• The k = 1 rule is generally called the “nearest-neighbor classification” rule
K-Nearest Neighbors (KNN)
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data/Pattern
and available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test data
class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to classify…
Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance from test sample to all
the data points in the training set.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.
Points X1 (Acid Durability ) X2(strength) Y=Classification
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
New pattern with X1=3, and X2=7 Identify the Class?

Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from
? ? ? ?
P1 P2 P3 P4
Euclide (7,7) (7,4) (3,4) (1,4)

an
Distan
ce of
P5(3,7)
from
Clas BAD BAD GOOD GOOD

s
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M New customer named 'Mary’ has height
158 59 M
161cm and weight 61kg.
158 63 M
160 59 M
160 60 M
Suggest the T shirt Size with K=3,5
163 60 M
using Euclidean Distance and
163 61 M also Manhattan Distance
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give
the ads to the users who are interested in buying that SUV.
So for this problem, we have a dataset that contains
multiple user's information through the social network. The
dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent
variable. Dataset is as shown in the table. Using K =5
classify the new sample
• There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.
• Advantages of KNN Algorithm:

• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Disadvantages of KNN Algorithm:

• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data
Another example: solve
• Because the distance function used to find the k nearest
neighbors is not linear, so it usually won't lead to a linear
decision boundary.
Adaptive decision Boundaries
• Nearest neighbor techniques can approximate arbitrarily complicated

decision regions, but their error rates may be larger than Bayesian
rates.
• Experimentation may be required to choose K and to edit the reference
samples.
• Classification may be time consuming if the number of reference
samples is large.
• An alternate solution is to assume that the functional form of the
decision boundary between each pair of classes is given, and to find
the decision boundary of that form which best separates the classes in
some sense.
Adaptive Decision Boundaries. Continued
• For example, assume that a linear decision boundary will be used to classify
samples into two classes and each sample has M features.
• Then the discriminant function has the form
D=w0 + w1x1+…wMxM
• If D = 0 is the equation of the decision boundary between the two classes.
• The weights w0,w1…wM are to be chosen to provide good performance on the
set.
• A sample with vector (x1,x2…xM) is classified into one class,
say class 1 if D>0 and into another class say -1 if D<0
• W0 is the interceptor and W1,W2…WM are all the weights related to slopes.
It is of the form Y=Mx+C or Y=C+Mx
Adaptive decision boundaries …continued
• Geometrically with D=0 is the equation of a hyperplane decision boundary that divides
the M-dimensional feature space into two regions
• Two classes are said to be linearly separable if there exists a hyperplane decision
boundary such that D>0 for all the samples in class 1 and D<0 for all the samples in the
class -1.
• Figure shows two classes which are separated by a hyperplane.
• Weights w1,w2..wM can be varied. Boundary will be adapted based on the weights.
• During the adaptive or training phase, samples are presented to the current form of the
classifier. Whenever a sample is correctly classified no change is made in the weights.
• When a sample is incorrectly classified, each weight is changed to correct the output.
Adaptive decision boundary algorithm
1. Initialize the weights w0,w2,…wM to zero or to small random values to
some initial guesses.
2. Choose the next sample x=(x1,x2..xM) from the training set. Let the ‘true’
class or desired value of D be d, so that d=1 or -1 represents the true class of x.
3. Compute D=w0+w1x1+..wMxM.
4.If D not equal to d, replace wi by (wi+cdxi) (small change).
5. Repeat the steps 2 to 4 with each samples in the training set. When finished
run through the entire training data set again.
6.Stop and report perfect classification when all the samples are classified
properly.
• If there are N classes and M features the set of linear
discriminant function is
• D1=w10 + w11x1+…w1MxM
• D2=w20 + w21x1+…w2MxM
• ……
• Dn=wN0 + wN1x1+…wNMxM
Minimum Squared Error Discriminant Functions
• Although the adaptive decision boundary and adaptive

discriminate function techniques have considerable appeal, it
requires lot of iterations.
• Alternate solution is to have the “ Minimum Squared Error
(MSE)” classification procedure.
• MSE does not require iteration.
• MSE uses single discriminant function regardless of the number
of classes.
End of Unit - 2
Value of K
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from other
classes
Rule of thumb:
K = sqrt(N)
N: number of
training points
Choosing K Value
• There is no structured method to find the best value for “K”. We need to find
out with various values by trial and error and assuming that training data is
unknown.
• Choosing smaller values for K can be noisy and will have a higher influence on
the result.
• Larger values of K will have smoother decision boundaries which mean lower
variance but increased bias. Also, computationally expensive.
• In general, practice, choosing the value of k is k = sqrt(N) where N stands for
the number of samples in your training dataset.
• Try and keep the value of k odd in order to avoid confusion between two
classes of data
Algorithm
• Step-1: Select the value for K neighbors
• Step-2: Calculate the Euclidean distance from the test sample to every
sample point in the dataset.
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points
in each category.
• Step-5: Assign the new data points to that
category for which the number of the neighbor is
maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class
labels, using K nearest neighbour method classify a new unknown sample using k =3
and k = 2.
Points X1 (Acid Durability ) X2(strength) Y=Classification
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN
P1 P2 P3 P4
Euclidean (7,7) (7,4) (3,4) (1,4)

Distance of
P5(3,7)
from
P1 P2 P3 P4
Euclide (7,7) (7,4) (3,4) (1,4)

an
Distanc
e of
P5(3,7)
from
Class BAD BAD GOOD GOOD

Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M New customer named 'Mary’ has height 161cm and weight
158 63 M 61kg. Suggest the T shirt Size with K=3,5 using Euclidean
160 59 M Distance and Also Manhattan Distance
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give
the ads to the users who are interested in buying that SUV.
So for this problem, we have a dataset that contains
multiple user's information through the social network.
The dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent
variable. Dataset is as shown in the table. Using K =5
classify the new sample
Advantages of KNN Algorithm:
● It is simple to implement.
● It is robust to the noisy training data
● It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
● Always needs to determine the value of K which may be complex some time.
● The computation cost is high because of calculating the distance between the
test data point and all the samples present in the dataset.
Applications of KNN
Banking System : KNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have the
characteristics similar to the defaulters one?
Credit Ratings : KNN algorithms can be used to find an individual’s credit rating by
comparing with the persons having similar traits.
Politics : With the help of KNN algorithms, we can classify a potential voter
into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party
‘Congress’, “Will Vote to Party ‘BJP’.
KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image
Recognition and Video Recognition.

Unit 2 .Statistical Decision Making-1

Uploaded by

Copyright:

Available Formats

Unit 2 .Statistical Decision Making-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 .Statistical Decision Making-1

Uploaded by

Copyright:

Available Formats

Unit-2

Statistical Decision Making

• Classification of Statistical Estimation:

• Parametric methods are assumed to be a normal distribution.

• Parameters for using the normal distribution is –

• On the other hand cold related disease is less as the temperature

• Final result will be 35/5 = 7 = is a positive covariance

Revisiting conditional probability

From the above expressions, we can rewrite

P[A|B] = P[A].P[B|A] / P[B] - Bayes Rule

Compute : Probability in the deck of cards (52 excluding jokers)

Cold (C) and not-cold (C’). Feature is fever (f).

Prior probability of a person having a cold, P(C) = 0.01.

• Probability of B in total can be given by

• Equation from the previous slide:

• Replacing first in the second equation in this slide, we will get:

• It is found in the following statement :

If a person gets a positive test result,

P(A|X) = (.9 * .01) / (.9 * .01 + .096 * .99) = 0.0865 (8.65%).

The probability of having the faulty gene on the test is 8.65%.

Let women having cancer is W and ~W is women not having cancer.

A diagnostic test gives a positive result in:

◦ 99% of people with the disease

A person receives a positive result

• Ex: Tossing two coins… are independent.

• P(x,y|z)=P(x|z).P(y|z) : indicates that X and Y are independent given Z.

• Example: X: Throw a dice

So X and Y are conditionally independent and Z also conditionally independent.

– Further let us add a condition Z.

• Let us say X : Dice throw 1

• Let us add Z = sum of the dice

P(A|M)P(M) > P(A|N)P(N)

Cond. to next slide…

D Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1

Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013

Case3 is the strongest; Hence correct classification is Very Late

– Given Sentence: Delicious Food

• P(Not Liked | attributes) = P(Delicious | Not Liked) * P(Food | Not

• What an Entropy basically does?

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak NO

2 Sunny Hot High Strong NO

3 Overcast Hot High Weak YES

4 Rain Mild High Weak YES

5 Rain Cool Normal Weak YES

6 Rain Cool Normal Strong NO

7 Overcast Cool Normal Strong YES

8 Sunny Mild High Weak NO

9 Sunny Cool Normal Weak YES

10 Rain Mild Normal Weak YES

11 Sunny Mild Normal Strong YES

12 Overcast Mild High Strong YES

13 Overcast Hot Normal Weak YES

14 Rain Mild High Strong NO

• Gini Index or Gini impurity measures the degree or probability of

For Example : the height in the first row is 0.1/4 = 0.025

-4 to -2 = 1, -2 to 0 = 2, 0 to -2 = 1 -2 to -4 =0, -4 to -6=1 and -6 to -8=1

● The area under the curve of the function must be equal to

Compute the Euclidean Distance between the following data set

• In general Manhattan Distance = sum for i to N sum of |v1[i] – v2[i]|

x . y = 31 + 20 + 00 + 50 = 3

||d1|| = (33+22+00+55+00+00+00+22+00+00) 0.5 = (42) 0.5 = 6.481