Baes Theory

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 76

Bayesian Decision Theory

Introduction
• Data comes from a process that is not
completely known. This lack of knowledge is
indicated by modeling the process as a random
process.
• Maybe the process is actually deterministic,
but because we do not have access to complete
knowledge about it, we model it as random
and use probability theory to analyze it.
Introduction
• Tossing a coin is a random process because we cannot
predict at any toss whether the outcome will be heads or
tails and that is why we toss coins, or buy lottery tickets
• We can only talk about the probability that the outcome of
the next toss will be heads or tails.
• It may be argued that if we have access to extra
knowledge such as the exact composition of the coin, its
initial position, the force and its direction that is applied to
the coin when tossing it, where and how it is caught, and
so forth, the exact outcome of the toss can be predicted.
Introduction
• The extra pieces of knowledge that we do not have
access to are named the unobservable variables.
• In the coin tossing example, the only observable
variable is the outcome of the toss.
• Denoting the unobservables by z and the observable as
x, in reality we have
x = f (z)
• where f (·) is the deterministic function that defines the
outcome from the unobservable pieces of knowledge.
Introduction
• Because we cannot model the process this way,
we define the outcome X as a random variable
drawn from a probability distribution P(X = x)
that specifies the process.
• The outcome of tossing a coin is heads or tails,
and we define a random variable that takes one
of two values. Let us say X = 1 denotes that the
outcome of a toss is heads and X = 0 denotes
tails.
Introduction
• Such X are Bernoulli distributed where the
parameter of the distribution po is the
probability that the outcome is heads:
P(X = 1) = po and
P(X = 0) = 1 − P(X = 1) = 1 − po
• Assume that we are asked to predict the outcome
of the next toss. If we know po, our prediction
will be heads if po > 0.5 and tails otherwise.
Introduction
• If we do not know P(X) and want to estimate
this from a given sample, then we are in the
realm of statistics.
• We have a sample, X, containing examples
drawn from the probability distribution of the
observables xt , denoted as p(x).
• The aim is to build an approximator to it,
ˆp(x), using the sample X.
Introduction
• In the coin tossing example, the sample
contains the outcomes of the past N tosses.
Then using X, we can estimate po, which is
the parameter that uniquely specifies the
distribution. Our estimate of po is
po = #{tosses with outcome heads}
#{tosses}
Introduction
• Numerically using the random variables, xt is
1 if the outcome of toss t is heads and 0
otherwise. Given the sample {heads, heads,
heads, tails, heads, tails, tails, heads, heads},
we have X = {1, 1, 1, 0, 1, 0, 0, 1, 1} and the
estimate is
Classification
• Take an example of credit scoring, in a bank,
according to their past transactions, some
customers are low-risk in that they paid back
their loans and the bank profited from them and
other customers are high-risk in that they
defaulted.
• Analyzing this data, we would like to learn the
class “high-risk customer” so that in the future,
when there is a new application for a loan.
Classification
• Using our knowledge of the application, let us
say that we decide that there are two pieces of
information that are observable.
• We observe them because we have reason to
believe that they give us an idea about the
credibility of a customer.
• Let us say, for example, we observe customer’s
yearly income and savings, which we represent
by two random variables X1 and X2.
Classification
• It may again be claimed that if we had access to other
pieces of knowledge such as the state of economy in full
detail and full knowledge about the customer, his or her
intention, moral codes, and so forth, whether someone is a
low-risk or high-risk customer could have been
deterministically calculated.
• But these are non observables and with what we can
observe, the credibility of a customer is denoted by a
Bernoulli random variable C conditioned on the
observables X = [X1,X2]T where C = 1 indicates a high-risk
customer and C = 0 indicates a low-risk customer.
Classification
• Thus if we know P(C|X1,X2), when a new
application arrives with X1 = x1 and X2 = x2, we
can
• The problem then is to be able to calculate
P(C|x). Using Bayes’ rule, it can be written as
Simple Example
• Man is known to speak truth 3 out of 4 times.
He says on a throw “it is a six” What is the
probability it is actually a six?
Simple Example
• Probability that man speaks the truth is 3/4​.

The probability that man lies is 1/4.

Probability of getting a six =1/6​

Probability of not getting a six =5/6​

Applying Baye's theorem, we get the required probability as 

1/6 *​3/4​ 
= __________________
1/6 * 3/4 + 5/6 * 1/4​​

= 3/8​.
 
Classification
• Combining the prior and what the data tells us using
Bayes’ rule, we calculate the posterior probability of the
concept, P(C|x), after having seen the observation, x.

• We will look into an example classifier based on Naive


Bayes Classifier algorithm
Car theft Example - Attributes are Color ,
Type , Origin, and the subject, stolen can be
either yes or no.
Data set
Example
• We want to classify a Red, SUV, Domestic
outcome
• . Note there is no example of a Red Domestic
SUV in our data set. We need to calculate the
probabilities
Example

Sports:
P (Yes) =4/6
P (No) = 2/6

SUV:
P(Yes) = 1/4
P(No) = 3/4
Example

= 3/5 * 1/4 * 2/5 * 0.5


= 0.03

= 2/5 * 3/4 * 3/5* 0.5


= 0.09
Example
• 0.03 / P(x) + 0.09 / P(x) =1
• P(x) = 0.12
• P (Yes) = 0.25 & P (No) =0.75

• So, the outcome would be “No”


Losses and Risks
• A financial institution when making a decision
for a loan applicant should take into account
the potential gain and loss as well.
• An accepted low-risk applicant increases
profit, while a rejected high-risk applicant
decreases loss.
Losses and Risks
• The loss for a high-risk applicant erroneously
accepted may be different from the potential
gain for an erroneously rejected low-risk
applicant.
• The situation is much more critical and far
from symmetry in other domains like medical
diagnosis or earthquake prediction.
Losses and Risks
• Objective : Reduce the probability of error
Actions -> α
Loss function -> λ (αi / Ck)
• Loss incurred taking action αi when true state
of nature is Ck
Losses and Risks
• Let us define action αi as the decision to assign
the input to class Ci and λik as the loss
incurred for taking action αi when the input
actually belongs to Ck. Then the expected risk
for taking action αi is

and we choose the action with minimum risk:


Losses and Risks
• Let us define K actions αi, i = 1, . . . , K, where
αi is the action of assigning x to Ci. In the
special case of the 0/1 loss case where

• all correct decisions have no loss and all errors


are equally costly. The risk of taking action α i
is
Losses and Risks
λ (αi / Ck) => λik
Discriminant Functions
• Classification can also be seen as
implementing a set of discriminant functions,
gi(x), i = 1, . . . , K, such that we

gi(x) > gj j≠i


x ∈ Ci
Discriminant Functions
• We can represent the Bayes’ classifier in this
way by setting
gi(x) = −R(αi|x)
• and the maximum discriminant function
corresponds to minimum conditional risk.
When we use the 0/1 loss function, we have
gi(x) = P(Ci |x)
Discriminant Functions
Discriminant Functions
• When there are two classes, we can define a
single discriminant
g(x) = g1(x) − g2(x)
• We choose
C1 if g(x) > 0
C2 otherwise
Utility Theory
• We have already defined the expected risk and
chose the action that minimizes expected risk.
• We now generalize this to utility theory, which
is concerned with making rational decisions
when we are uncertain about the state.
• Let us say that given evidence x, the
probability of state Sk is calculated as P(Sk|x).
Utility Theory
• We define a utility function, Uik, which
measures how good it is to take action α i when
the state is Sk. The expected utility is

• A rational decision maker chooses the action


that maximizes the expected utility
Utility Theory
• Uik are generally measured in monetary terms,
and this gives us a way to define the loss matrix
λik as well.
• For example, in defining a reject option, if we
know how much money we will gain as a result
of a correct decision, how much money we will
lose on a wrong decision, and how costly it is to
defer the decision to a human expert, depending
on the particular application we have
Utility Theory
• we can fill in the correct values U ik in a
currency unit, instead of 0, λ, and 1, and make
our decision so as to maximize expected
earnings.
• Note that maximizing expected utility is just
one possibility; one may define other types of
rational behavior, for example, minimizing the
worst possible loss.
Utility Theory
• In the case of reject, we are choosing between
the automatic decision made by the computer
program and human decision that is costlier
but assumed to have a higher probability of
being correct.
Bias and Variance & Bayes’
Estimator
Introduction
• Having discussed how to make optimal
decisions when the uncertainty is modeled using
probabilities, we now see how we can estimate
these probabilities from a given training set.
• A statistic is any value that is calculated from a
given sample.
• In statistical inference, we make a decision
using the information provided by a sample.
Introduction
• Understanding how different sources of error
lead to bias and variance helps us improve the
data fitting process resulting in more accurate
models.
• We define bias and variance in three ways:
conceptually, graphically and mathematically.
Conceptual Definition

• Error due to Bias: The error due to bias is


taken as the difference between the expected
(or average) prediction of our model and the
correct value which we are trying to predict.
• Error due to Variance: The error due to
variance is taken as the variability of a model
prediction for a given data point.
Graphical Definition
Mathematical Definition
• Let X be a sample from a population specified up to
a parameter θ, and let d = d(X) be an estimator of θ.
• To evaluate the quality of this estimator, we can
measure how much it is different from θ, that is,
(d(X)− θ)2.
• But since it is a random variable , we need to
average this over possible X and consider r(d, θ), the
mean square error of the estimator d defined as
r(d, θ) = E[(d(X) − θ)2]
Mathematical Definition
• The bias of an estimator is given as
bθ(d) = E[d(X)] − θ
• If bθ (d) = 0 for all θ values, then we say that d
is an unbiased estimator of θ.
Mathematical Definition

• The two equalities follow because E[d] is a


constant and therefore E[d]− θ also is a
constant, and because E[d − E[d]] = E[d] −
E[d] = 0.
Bias and Variance
• the first term is the variance that measures
how much, on average, di vary around the
expected value (going from one dataset to
another), and the second term is the bias that
measures how much the expected value varies
from the correct value θ.
Bias and Variance
• We then write error as the sum of these two
terms, the variance and the square of the bias:
• r(d, θ ) = Var(d) + (bθ (d))2
An Illustrative Example: Voting Intentions

• Consider a simple model building task. We


wish to create a model for the percentage of
people who will vote for a Republican
president in the next election.
• As models go, this is conceptually trivial and is
much simpler than what people commonly
envision when they think of "modeling", but it
helps us to cleanly illustrate the difference
between bias and variance.
An Illustrative Example
• A straightforward way to build this model
would be to randomly choose 50 numbers
from the phone book, call each one and ask the
responder who they planned to vote for in the
next election. Imagine we got the following
results:
Voting Republican Voting Democratic Non-Respondent Total
13 16 21 50
An Illustrative Example
• From the data, we estimate that the probability
of voting Republican is 13/(13+16), or 44.8%.
and there is going to be a press release that the
Democrats are going to win by over 10 points;
but, when the election comes around, it turns
out they actually lose by 10 points.
• That certainly reflects poorly on the model.
Where did it go wrong in this model?
An Illustrative Example
• Clearly, there are many issues with the trivial
model built. 
• A list would include that it only sample people
from the phone book and so only include
people with listed numbers and did not follow
up with non-respondents and they might have
different voting patterns from the respondents.
it has a very small sample size.
An Illustrative Example
• For instance, using a phonebook to select
participants in this survey is one of sources of
bias.
• By only surveying certain classes of people, it
skews the results in a way that will be consistent
if repeated the entire model building exercise.
• Similarly, not following up with respondents is
another source of bias, as it consistently changes
the mixture of responses we get.
An Illustrative Example
• On bulls-eye diagram these move away from
the centre of the target, but they would not
result in an increased scatter of estimates.
• On the other hand, the small sample size is a
source of variance. 
• If we increase our sample size, the results
would be more consistent each time we
repeated the survey and prediction.
An Illustrative Example
• The results still might be highly inaccurate due
to the large sources of bias, but the variance of
predictions will be reduced.
• On the bulls-eye diagram, the low sample size
results in a wide scatter of estimates. 
• Increasing the sample size would make the
estimates clump closers together, but they still
might miss the centre of the target.
An Illustrative Example
• In general the data set used to build the model is
provided prior to model construction and the
modeller cannot simply say, "Let's increase the
sample size to reduce variance.“
• In practice an explicit tradeoff exists between
bias and variance where decreasing one
increases the other. 
• Minimizing the total error of the model requires
a careful balancing of these two forms of error.
The Bayes’ Estimator
• Sometimes, before looking at a sample, we (or
experts of the application) may have some prior
information on the possible value range that a
parameter θ, may take.
• This information is quite useful and should be used,
especially when the sample is small.
• The prior information does not tell us exactly what
the parameter value is, and we model this uncertainty
by viewing θ as a random variable and by defining a
prior density for it, p(θ).
The Bayes’ Estimator

• θ is the parameter to be estimated. di are several


estimates (denoted by ‘×’) over different samples Xi .
Bias is the difference between the expected value of d
and θ. Variance is how much di are scattered around
the expected value. We would like both to be small.
Example
• A traffic control engineer believes that the cars passing
through a particular intersection arrive at a mean
rate λ equal to either 3 or 5 for a given time interval. Prior
to collecting any data, the engineer believes that it is
much more likely that the rate λ = 3 than λ = 5. In fact,
the engineer believes that the prior probabilities are:
P(λ=3)=0.7    and      P(λ=5)=0.3
One day, during a randomly selected time interval, the engineer
observes x = 7 cars pass through the intersection. In light of the
engineer's observation, what is the probability that λ = 3? And
what is the probability that λ = 5?
Example
• Now, simply by using the definition of conditional probability,
we know that the probability that λ = 3 given that X= 7 is:
P(λ=3|X=7)= P(λ=3,X=7)
P(X=7)

• which can be written using Baye's Theorem as:

P(λ=3|X=7)= P(λ=3)P(X=7|λ=3)
P(λ=3)P(X=7|λ=3)+P(λ=5)P(X=7|λ=5)
Example
• We can use the Poisson cumulative probability
table to find P(X = 7 | λ = 3) and P(X = 7 | λ =
5). They are:
P(X=7|λ=3)=0.022 
and    P(X=7|λ=5)=0.104
Example
• Now, we have everything we need to finalize
our calculation of the desired probability:
P(λ=3|X=7)= (0.7)(0.022)
(0.7)(0.022)+(0.3)(0.104)
= 0.01540
0.0154+0.0315
= 0.328
Example
• The initial probability, in this case, P(λ = 3) =
0.7, is called the prior probability. That's
because it is the probability that the parameter
takes on a particular value prior to taking into
account any new information. The newly
calculated probability, that is:
• P(λ = 3 | X = 7) is called the posterior
probability.
Example
• That's because it is the probability that the
parameter takes on a particular value posterior
to, that is, after, taking into account the new
information. In this case, we have seen that the
probability that λ = 3 has decreased from 0.7
(the prior probability) to 0.328 (the posterior
probability) with the information obtained
from the observation x = 7.
Example
• A similar calculation can be made in
finding P(λ = 5 | X = 7). In doing so, we see:

P(λ=5|X=7)= (0.3)(0.104)
(0.7)(0.022)+(0.3)(0.104)
= 0.03150
0.0154+0.0315
= 0.672
Example
• In this case, we see that the probability that λ = 5
has increased from 0.3 (the prior probability) to
0.672 (the posterior probability) with the
information obtained from the observation x = 7.
• That example is good for illustrating the distinction
between prior probabilities and posterior
probabilities, but it falls a bit short as a practical
example in the real world when parameter  θ  takes
on an infinite number of possible values.
Parametric Classification
• Models of data with a categorical response are
called classifiers.
• A classifier is built from training data, for which
classifications are known.
• The classifier assigns new test data to one of the
categorical levels of the response.
• Parametric methods like Discriminant Analysis
Classification, fit a parametric model to the
training data and interpolate to classify test data.
Parametric Classification
• Assume there is a car company selling K
different cars.
• For solving this, let us assume that the sole
factor that affects a customer’s choice is his or
her yearly income which is denoted by x.
• Then P(Ci) is the proportion of customers who
buy car type i.
Parametric Classification
• P(x|Ci), the probability that a customer who
bought car type i has income x.
• It can be taken as N(μi,σ2i ), where μi is the mean
income of such customers and σ2i is their income
variance.
• When we do not know P(Ci) and p(x|Ci), we
estimate them from a sample and plug in their
estimates to get the estimate for the discriminant
function.
Model Selection Procedures
• There are a number of procedures we can use
to fine-tune model complexity.
• In practice, the method we use to find the
optimal complexity is cross validation.
• We cannot calculate bias and variance for a
model, but we can calculate the total error.
Model Selection Procedures
• Given a dataset, we divide it into two parts as
training and validation sets, train candidate
models of different complexities, and test their
error on the validation set left out during
training.
• As the model complexity increases, training
error keeps decreasing.
Model Selection Procedures
• The error on the validation set decreases up to
a certain level of complexity, then stops
decreasing or does not decrease further
significantly, or even increases if there is
significant noise.
• Another approach that is used frequently is
regularization. In this approach, we write an
augmented error function
E = error on data + λ ·model complexity
Model Selection Procedures
• Methods such as Akaike’s information criterion
(AIC) and Bayesian information criterion BIC (BIC)
work by estimating this optimism and adding it to
the training error to estimate test error, without any
need for validation.
• Bayesian model selection is used when we have
some prior knowledge about the appropriate class of
approximating functions. This prior knowledge is
defined as a prior distribution over models,
p(model).
Model Selection Procedures
• Given the data and assuming a model, we can calculate
p(model|data) using Bayes’ rule:
p(model|data) = p(data|model) p(model)
p(data)

• p(model|data) is the posterior probability of the model given


our prior subjective knowledge about models, namely,
p(model), and the objective support provided by the data,
namely, p(data|model). We can then choose the model with
the highest posterior probability, or take an average overall
models weighted by their posterior probabilities.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy