Baes Theory
Baes Theory
Baes Theory
Introduction
• Data comes from a process that is not
completely known. This lack of knowledge is
indicated by modeling the process as a random
process.
• Maybe the process is actually deterministic,
but because we do not have access to complete
knowledge about it, we model it as random
and use probability theory to analyze it.
Introduction
• Tossing a coin is a random process because we cannot
predict at any toss whether the outcome will be heads or
tails and that is why we toss coins, or buy lottery tickets
• We can only talk about the probability that the outcome of
the next toss will be heads or tails.
• It may be argued that if we have access to extra
knowledge such as the exact composition of the coin, its
initial position, the force and its direction that is applied to
the coin when tossing it, where and how it is caught, and
so forth, the exact outcome of the toss can be predicted.
Introduction
• The extra pieces of knowledge that we do not have
access to are named the unobservable variables.
• In the coin tossing example, the only observable
variable is the outcome of the toss.
• Denoting the unobservables by z and the observable as
x, in reality we have
x = f (z)
• where f (·) is the deterministic function that defines the
outcome from the unobservable pieces of knowledge.
Introduction
• Because we cannot model the process this way,
we define the outcome X as a random variable
drawn from a probability distribution P(X = x)
that specifies the process.
• The outcome of tossing a coin is heads or tails,
and we define a random variable that takes one
of two values. Let us say X = 1 denotes that the
outcome of a toss is heads and X = 0 denotes
tails.
Introduction
• Such X are Bernoulli distributed where the
parameter of the distribution po is the
probability that the outcome is heads:
P(X = 1) = po and
P(X = 0) = 1 − P(X = 1) = 1 − po
• Assume that we are asked to predict the outcome
of the next toss. If we know po, our prediction
will be heads if po > 0.5 and tails otherwise.
Introduction
• If we do not know P(X) and want to estimate
this from a given sample, then we are in the
realm of statistics.
• We have a sample, X, containing examples
drawn from the probability distribution of the
observables xt , denoted as p(x).
• The aim is to build an approximator to it,
ˆp(x), using the sample X.
Introduction
• In the coin tossing example, the sample
contains the outcomes of the past N tosses.
Then using X, we can estimate po, which is
the parameter that uniquely specifies the
distribution. Our estimate of po is
po = #{tosses with outcome heads}
#{tosses}
Introduction
• Numerically using the random variables, xt is
1 if the outcome of toss t is heads and 0
otherwise. Given the sample {heads, heads,
heads, tails, heads, tails, tails, heads, heads},
we have X = {1, 1, 1, 0, 1, 0, 0, 1, 1} and the
estimate is
Classification
• Take an example of credit scoring, in a bank,
according to their past transactions, some
customers are low-risk in that they paid back
their loans and the bank profited from them and
other customers are high-risk in that they
defaulted.
• Analyzing this data, we would like to learn the
class “high-risk customer” so that in the future,
when there is a new application for a loan.
Classification
• Using our knowledge of the application, let us
say that we decide that there are two pieces of
information that are observable.
• We observe them because we have reason to
believe that they give us an idea about the
credibility of a customer.
• Let us say, for example, we observe customer’s
yearly income and savings, which we represent
by two random variables X1 and X2.
Classification
• It may again be claimed that if we had access to other
pieces of knowledge such as the state of economy in full
detail and full knowledge about the customer, his or her
intention, moral codes, and so forth, whether someone is a
low-risk or high-risk customer could have been
deterministically calculated.
• But these are non observables and with what we can
observe, the credibility of a customer is denoted by a
Bernoulli random variable C conditioned on the
observables X = [X1,X2]T where C = 1 indicates a high-risk
customer and C = 0 indicates a low-risk customer.
Classification
• Thus if we know P(C|X1,X2), when a new
application arrives with X1 = x1 and X2 = x2, we
can
• The problem then is to be able to calculate
P(C|x). Using Bayes’ rule, it can be written as
Simple Example
• Man is known to speak truth 3 out of 4 times.
He says on a throw “it is a six” What is the
probability it is actually a six?
Simple Example
• Probability that man speaks the truth is 3/4.
1/6 *3/4
= __________________
1/6 * 3/4 + 5/6 * 1/4
= 3/8.
Classification
• Combining the prior and what the data tells us using
Bayes’ rule, we calculate the posterior probability of the
concept, P(C|x), after having seen the observation, x.
Sports:
P (Yes) =4/6
P (No) = 2/6
SUV:
P(Yes) = 1/4
P(No) = 3/4
Example
P(λ=3|X=7)= P(λ=3)P(X=7|λ=3)
P(λ=3)P(X=7|λ=3)+P(λ=5)P(X=7|λ=5)
Example
• We can use the Poisson cumulative probability
table to find P(X = 7 | λ = 3) and P(X = 7 | λ =
5). They are:
P(X=7|λ=3)=0.022
and P(X=7|λ=5)=0.104
Example
• Now, we have everything we need to finalize
our calculation of the desired probability:
P(λ=3|X=7)= (0.7)(0.022)
(0.7)(0.022)+(0.3)(0.104)
= 0.01540
0.0154+0.0315
= 0.328
Example
• The initial probability, in this case, P(λ = 3) =
0.7, is called the prior probability. That's
because it is the probability that the parameter
takes on a particular value prior to taking into
account any new information. The newly
calculated probability, that is:
• P(λ = 3 | X = 7) is called the posterior
probability.
Example
• That's because it is the probability that the
parameter takes on a particular value posterior
to, that is, after, taking into account the new
information. In this case, we have seen that the
probability that λ = 3 has decreased from 0.7
(the prior probability) to 0.328 (the posterior
probability) with the information obtained
from the observation x = 7.
Example
• A similar calculation can be made in
finding P(λ = 5 | X = 7). In doing so, we see:
P(λ=5|X=7)= (0.3)(0.104)
(0.7)(0.022)+(0.3)(0.104)
= 0.03150
0.0154+0.0315
= 0.672
Example
• In this case, we see that the probability that λ = 5
has increased from 0.3 (the prior probability) to
0.672 (the posterior probability) with the
information obtained from the observation x = 7.
• That example is good for illustrating the distinction
between prior probabilities and posterior
probabilities, but it falls a bit short as a practical
example in the real world when parameter θ takes
on an infinite number of possible values.
Parametric Classification
• Models of data with a categorical response are
called classifiers.
• A classifier is built from training data, for which
classifications are known.
• The classifier assigns new test data to one of the
categorical levels of the response.
• Parametric methods like Discriminant Analysis
Classification, fit a parametric model to the
training data and interpolate to classify test data.
Parametric Classification
• Assume there is a car company selling K
different cars.
• For solving this, let us assume that the sole
factor that affects a customer’s choice is his or
her yearly income which is denoted by x.
• Then P(Ci) is the proportion of customers who
buy car type i.
Parametric Classification
• P(x|Ci), the probability that a customer who
bought car type i has income x.
• It can be taken as N(μi,σ2i ), where μi is the mean
income of such customers and σ2i is their income
variance.
• When we do not know P(Ci) and p(x|Ci), we
estimate them from a sample and plug in their
estimates to get the estimate for the discriminant
function.
Model Selection Procedures
• There are a number of procedures we can use
to fine-tune model complexity.
• In practice, the method we use to find the
optimal complexity is cross validation.
• We cannot calculate bias and variance for a
model, but we can calculate the total error.
Model Selection Procedures
• Given a dataset, we divide it into two parts as
training and validation sets, train candidate
models of different complexities, and test their
error on the validation set left out during
training.
• As the model complexity increases, training
error keeps decreasing.
Model Selection Procedures
• The error on the validation set decreases up to
a certain level of complexity, then stops
decreasing or does not decrease further
significantly, or even increases if there is
significant noise.
• Another approach that is used frequently is
regularization. In this approach, we write an
augmented error function
E = error on data + λ ·model complexity
Model Selection Procedures
• Methods such as Akaike’s information criterion
(AIC) and Bayesian information criterion BIC (BIC)
work by estimating this optimism and adding it to
the training error to estimate test error, without any
need for validation.
• Bayesian model selection is used when we have
some prior knowledge about the appropriate class of
approximating functions. This prior knowledge is
defined as a prior distribution over models,
p(model).
Model Selection Procedures
• Given the data and assuming a model, we can calculate
p(model|data) using Bayes’ rule:
p(model|data) = p(data|model) p(model)
p(data)