MC4301 - ML Unit 3 (Bayesian Learning)

12/18/23, 6:23 PM MC4301 - ML Unit 3 (Bayesian Learning)
C. Abdul Hakeem College of Engineering & Technology

Department of Master of Computer Applications
MC4301 - Machine Learning
Unit 3
Bayesian Learning
Basic Probability Notation
Uncertainty:
we might write A→B, which means if A is true then B is true, but consider a situation
where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we
need uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.

2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the

concept of probability to indicate the uncertainty in knowledge. In probabilistic
reasoning, we combine probability theory with logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the

uncertainty that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A
match between two teams or two players." These are probable sentences for which we
can assume that it will happen but not sure about it, so here we use probabilistic
reasoning.
Need of probabilistic reasoning in AI:
 When there are unpredictable outcomes.

 When specifications or possibilities of predicates becomes too large to handle.
 When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:
about:blank 1/44
 Bayes' rule
 Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding

probabilistic reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur.
It is the numerical measure of the likelihood that an event will occur. The value of
probability always remains between 0 and 1 that represent ideal uncertainties.
≤ P(A) ≤ 1, where P(A) is the probability of an event A.
P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.
 P(¬A) = probability of a not happening event.

 P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in
the real world.
Prior probability: The prior probability of an event is probability computed before

observing new information.
Posterior Probability: The probability that is calculated after all evidence or

information has taken into account. It is a combination of prior probability and new
information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has

already happened.
about:blank 2/44
Let's suppose, we want to calculate the event A when event B has already occurred,
"the probability of A under the conditions of B", it can be written as:
⋀ B)= Joint probability of a and B

Where P(A⋀
P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be
given as:
It can be explained by using the below Venn diagram, where B is occurred event, so
sample space will be reduced to set B, and now we can only calculate event A when
⋀ B) by P( B ).
event B is already occurred by dividing the probability of P(A⋀
Example:
In a class, there are 70% of the students who like English and 40% of the students
who likes English and mathematics, and then what is the percent of students those
who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.
about:blank 3/44
Inference
Inference means to find a conclusion based on the facts, information, and

evidence. In simple words, when we conclude the facts and figures to reach a
particular decision, that is called inference. In artificial intelligence, the expert system
or any agent performs this task with the help of the inference engine. In the inference
engine, the information and facts present in the knowledge base are considered
according to the situation and the engine makes the conclusion out of these facts,
based on which the further processing and decision making takes place in the agent.
The inference process in an agent takes place according to some rules, which are
known as the inference rules or rule of inference. Following are the major types of
inference rules that are used:
1) Addition: This inference rule is stated as follows:
P
----------
∴P v Q
2) Simplification: This inference rule states that:
P^Q P^Q
---------- OR ----------
∴P ∴Q
3) Modus Ponens: This is the most widely used inference rule. It states:
P->Q
P
-----------
∴Q
4) Modus Tollens: This rule states that:
P->Q
~Q
-----------
∴~P
5) Forward Chaining: It is a type of deductive Inference rule. It states that:
P
P->Q
-----------
∴Q
about:blank 4/44
6) Backward Chaining: This is also a type of deductive inference rule. This rule
states that:
P
P->Q
-----------
∴P
7) Resolution: In the reasoning by resolution, we are given the goal condition and
available facts and statements. Using these facts and statements, we have to decide
whether the goal condition is true or not, i.e. is it possible for the agent to reach the
goal state or not. We prove this by the method of contradiction. This rule states that:
PvQ
~P^R
-----------
∴Q v R
8) Hypothetical Syllogism: This rule states the transitive relation between the
statements:
P->Q
Q->R
-----------
∴P->R
9) Disjunctive Syllogism: This rule is stated as follows:
PvQ
~P
-----------
∴Q
Machine learning (ML) inference is the process of running live data points into a
machine learning algorithm (or “ML model”) to calculate an output such as a single
numerical score. This process is also referred to as “operationalizing an ML model” or
“putting an ML model into production.” When an ML model is running in production,
it is often then described as artificial intelligence (AI) since it is performing functions
similar to human thinking and analysis. Machine learning inference basically entails
deploying a software application into a production environment, as the ML model is
typically just software code that implements a mathematical algorithm. That
algorithm makes calculations based on the characteristics of the data, known as
“features” in the ML vernacular.
An ML lifecycle can be broken up into two main, distinct parts. The first is the
training phase, in which an ML model is created or “trained” by running a specified
subset of data into the model. ML inference is the second phase, in which the model is
put into action on live data to produce actionable output. The data processing by the
about:blank 5/44
ML model is often referred to as “scoring,” so one can say that the ML model scores
the data, and the output is a score.
ML inference is generally deployed by DevOps engineers or data engineers.

Sometimes the data scientists, who are responsible for training the models, are asked
to own the ML inference process. This latter situation often causes significant
obstacles in getting to the ML inference stage, since data scientists are not necessarily
skilled at deploying systems. Successful ML deployments often are the result of tight
coordination between different teams, and newer software technologies are also often
deployed to try to simplify the process. An emerging discipline known as “MLOps” is
starting to put more structure and resources around getting ML models into
production and maintaining those models when changes are needed.
How Does Machine Learning Inference Work?
To deploy a machine learning inference environment, you need three main

components in addition to the model:
1. One or more data sources

2. A system to host the ML model
3. One or more data destinations
In machine learning inference, the data sources are typically a system that captures the
live data from the mechanism that generates the data. The host system for the machine
learning model accepts data from the data sources and inputs the data into the
machine learning model. The data destinations are where the host system should
deliver the output score from the machine learning model.
The data sources are typically a system that captures the live data from the mechanism
that generates the data. For example, a data source might be an Apache Kafka cluster
that stores data created by an Internet of Things (IoT) device, a web application log
file, or a point-of-sale (POS) machine. Or a data source might simply be a web
about:blank 6/44
application that collects user clicks and sends data to the system that hosts the ML
model.
The host system for the ML model accepts data from the data sources and inputs the
data into the ML model. It is the host system that provides the infrastructure to turn
the code in the ML model into a fully operational application. After an output is
generated from the ML model, the host system then sends that output to the data
destinations. The host system can be, for example, a web application that accepts data
input via a REST interface, or a stream processing application that takes an incoming
feed of data from Apache Kafka to process many data points per second.
The data destinations are where the host system should deliver the output score from
the ML model. A destination can be any type of data repository like Apache Kafka or
a database, and from there, downstream applications take further action on the scores.
For example, if the ML model calculates a fraud score on purchase data, then the
applications associated with the data destinations might send an “approve” or “decline”
message back to the purchase site.
Challenges of Machine Learning Inference
As mentioned earlier, the work in ML inference can sometimes be misallocated to the

data scientist. If given only a low-level set of tools for ML inference, the data scientist
may not be successful in the deployment.
Additionally, DevOps and data engineers are sometimes not able to help with
deployment, often due to conflicting priorities or a lack of understanding of what’s
required for ML inference. In many cases, the ML model is written in a language like
Python, which is popular among data scientists, but the IT team is more well-versed in
a language like Java. This means that engineers must take the Python code and
translate it to Java to run it within their infrastructure. In addition, the deployment of
ML models requires some extra coding to map the input data into a format that the
ML model can accept, and this extra work adds to the engineers’ burden when
deploying the ML model.
Also, the ML lifecycle typically requires experimentation and periodic updates to the
ML models. If deploying the ML model is difficult in the first place, then updating
models will be almost as difficult. The whole maintenance effort can be difficult, as
there are business continuity and security issues to address.
Another challenge is attaining suitable performance for the workload. REST-based

systems that perform the ML inference often suffer from low throughput and high
latency. This might be suitable for some environments, but modern deployments that
deal with IoT and online transactions are facing huge loads that can overwhelm these
simple REST-based deployments. And the system needs to be able to scale to not only
handle growing workloads but to also handle temporary load spikes while retaining
consistent responsiveness.
about:blank 7/44
Independence
1. The intuition of Conditional Independence
Let’s say A is the height of a child and B is the number of words that the child
knows. It seems when A is high, B is high too.
There is a single piece of information that will make A and B completely

independent. What would that be?
The child’s age.
The height and the # of words known by the kid are NOT independent, but they are
conditionally independent if you provide the kid’s age.
2. Mathematical Form
A: The height of a child

B: The # of words that the child knows
C: The child's age
about:blank 8/44
A better way to remember the expression:
Conditional independence is basically the concept of independence P(A ∩ B) = P(A)

* P(B) applied to the conditional model.
Why is P(A|B ∩ C) = P(A|C) when (A ㅛ B)|C?
Here goes the proof.
about:blank 9/44
The gist of conditional independence: Knowing C makes A and B independent.
P(A,B|C) = P(A|C) * P(B|C)
3. Applications
Why does the conditional independence even matter?
Because it is a foundation for many statistical models that we use. (e.g., latent class
models, factor analysis, graphical models, etc.)
A. Conditional Independence in Bayesian Network (aka Graphical Models)
A Bayesian network represents a joint distribution using a graph. Specifically, it is

a directed acyclic graph in which each edge is a conditional dependency, and each
node is a distinctive random variable. It has many other names: belief network,
decision network, causal network, Bayes(ian) model or probabilistic directed acyclic
graphical model, etc.
It looks like so:
In order for the Bayesian network to model a probability distribution, it relies on

the important assumption: each variable is conditionally independent of its
non-descendants, given its parents.
For instance, we can simplify P(Grass Wet|Sprinkler, Rain) into P(Grass

Wet|Sprinkler) since Grass Wet is conditionally independent of its non-descendant,
Rain, given Sprinkler.
Using this property, we can simplify the whole joint distribution into the formula
below:
10
about:blank 10/44
What is so great about this approximation (Conditional Independent

assumption)?
Conditional independence between variables can greatly reduce the

number of parameters.
This reduces so much of the computation since we now only take into account its
parent and disregard everything else.
Let’s take a look at the numbers.
Let’s say you have n binary variables (= n nodes).
The unconstrained joint distribution requires O(2^n) probabilities.
For a Bayesian Network, with a maximum of k parents for any node, we need
only O(n * 2^k) probabilities. (This can be carried out in linear time for certain
numbers of classes.)
n = 30 binary variables, k = 4 maximum parents for nodes• Unconstrained Joint

Distribution: needs 2^30 (about 1 million) probabilities -> Intractable!• Bayesian
Network: needs only 480 probabilities
We can have an efficient factored representation for a joint

distribution using Conditional independence.
B. Conditional Independence in Bayesian Inference
Let’s say I’d like to estimate the engagement (clap) rate of my blog. Let p be the
proportion of readers who will clap for my articles. We’ll choose n readers randomly
from the population. For i = 1, …, n, let Xi = 1 if the reader claps or Xi = 0 if s/he
doesn’t.
In a frequentist approach, we don’t assign the probability distribution to p.

p would be simply ‘sum (Xi) / n’.
And we would treat X1, …, Xn as independent random variables.
11
about:blank 11/44
On the other hand, in Bayesian inference, we assume p follows a distribution, not

just a constant. In this model, the random variables X1, …, Xn are NOT
independent, but they are conditionally independent given the distribution of p.
C. Correlation ≠ Causation
“Correlation is not causation” means that just because two things correlate does not
necessarily mean that one causes the other.
Here is a hilarious example of taxi accidents:
A study has shown a positive and significant correlation between the number of
accidents and taxi drivers’ wearing coats. They found that coats might hinder the
driver’s movements and cause accidents. A new law was ready to ban taxi drivers
from wearing coats while driving.
Until another study pointed out that people wear coats when it rains…
P(accidents, coats | rain) = P(accidents | rain) * P(coats | rain)
Correlations between two things can be caused by a third factor that affects both of
them. This third factor is called a confounder. The confounder, which is rain, was
responsible for the correlation between accident and wearing coats.
P(accidents | coats, rain) = P(accidents | coats)
Note that this does NOT mean accidents are independent of rain. What it means is:
given drivers wearing coats, knowing rain doesn’t give any more information about
accidents.
4. Conditional Independence vs Marginal Independence
Marginal independence is just the same as plain independence,
Sometimes, two random variables might not be marginally independent. However,

they can become independent after we observe some third variable.
5. More examples!
 Amount of speeding fine ㅛ Type of car | Speed

 Lung cancer ㅛ Yellow teeth | Smoking
 Child’s genes ㅛ Grandparents’ genes | Parents’ genes
 Is the car’s starter motor working? ㅛ Is the car’s radio working?| Battery
 Future ㅛ Past | Present (This is the Markov assumption!)
They are all in the same form as A ㅛ B | C.

A and B look related if we don’t take C into account. However, once we include C in
the picture, then the apparent relationship between A and B disappears. As you see,
any causal relationship is potentially conditionally independent. We will never know
12
about:blank 12/44
for sure about the relationship between A & B until we test every possible C
(confounding variable)!
Bayes’ Rule
Bayes' Rule is the most important rule in data science. It is the mathematical rule that
describes how to update a belief, given some evidence. In other words – it describes
the act of learning.
The equation itself is not too complex:
The equation: Posterior = Prior x (Likelihood over Marginal probability)
There are four parts:
 Posterior probability (updated probability after the evidence is considered)

 Prior probability (the probability before the evidence is considered)
 Likelihood (probability of the evidence, given the belief is true)
 Marginal probability (probability of the evidence, under any circumstance)
Bayes' Rule can answer a variety of probability questions, which help us (and
machines) understand the complex world we live in.
It is named after Thomas Bayes, an 18th century English theologian and

mathematician. Bayes originally wrote about the concept, but it did not receive much
attention during his lifetime.
French mathematician Pierre-Simon Laplace independently published the rule in his

1814 work Essai philosophique sur les probabilités.
Today, Bayes' Rule has numerous applications, from statistical analysis to machine
learning.
Conditional probability
Conditional probability is the bridge that lets you talk about how multiple uncertain
events are related. It lets you talk about how the probability of an event can vary
under different conditions.
13
about:blank 13/44
For example, consider the probability of winning a race, given the condition you
didn't sleep the night before. You might expect this probability to be lower than the
probability you'd win if you'd had a full night's sleep.
Or, consider the probability that a suspect committed a crime, given that their
fingerprints are found at the scene. You'd expect the probability they are guilty to be
greater, compared with had their fingerprints not been found.
The notation for conditional probability is usually:
P(A|B)
Which is read as "the probability of event A occurring, given event B occurs".
An important thing to remember is that conditional probabilities are not the same as
their inverses.
That is, the "probability of event A given event B" is not the same thing as the
"probability of event B, given event A".
To remember this, take the following example:
The probability of clouds, given it is raining (100%) is not the same as

the probability it is raining, given there are clouds.
Bayes' Rule in detail
Bayes' Rule tells you how to calculate a conditional probability with information you
already have.
It is helpful to think in terms of two events – a hypothesis (which can be true or false)
and evidence (which can be present or absent).
14
about:blank 14/44
However, it can be applied to any type of events, with any number of discrete or
continuous outcomes.
Bayes' Rule lets you calculate the posterior (or "updated") probability. This is a
conditional probability. It is the probability of the hypothesis being true, if the
evidence is present.
Think of the prior (or "previous") probability as your belief in the hypothesis
before seeing the new evidence. If you had a strong belief in the hypothesis already,
the prior probability will be large.
The prior is multiplied by a fraction. Think of this as the "strength" of the evidence.
The posterior probability is greater when the top part (numerator) is big, and the
bottom part (denominator) is small.
The numerator is the likelihood. This is another conditional probability. It is the

probability of the evidence being present, given the hypothesis is true.
This is not the same as the posterior!
Remember, the "probability of the evidence being present given the hypothesis is
true" is not the same as the "probability of the hypothesis being true given the
evidence is present".
Now look at the denominator. This is the marginal probability of the evidence. That
is, it is the probability of the evidence being present, whether the hypothesis is true or
false. The smaller the denominator, the more "convincing" the evidence.
Worked example of Bayes' Rule
Here's a simple worked example.
Your neighbour is watching their favourite football (or soccer) team. You hear them
cheering, and want to estimate the probability their team has scored.
Step 1 – write down the posterior probability of a goal, given cheering
Step 2 – estimate the prior probability of a goal as 2%
Step 3 – estimate the likelihood probability of cheering, given there's a goal as 90%
(perhaps your neighbour won't celebrate if their team is losing badly)
Step 4 – estimate the marginal probability of cheering – this could be because:
15
about:blank 15/44
about:blank 16/44
about:blank 17/44
about:blank 18/44
about:blank 19/44
about:blank 20/44
about:blank 21/44
about:blank 22/44
about:blank 23/44
about:blank 24/44
about:blank 25/44
about:blank 26/44
about:blank 27/44
about:blank 28/44
about:blank 29/44
about:blank 30/44
about:blank 31/44
about:blank 32/44
about:blank 33/44
about:blank 34/44
about:blank 35/44
about:blank 36/44
about:blank 37/44
about:blank 38/44
about:blank 39/44
about:blank 40/44
about:blank 41/44
about:blank 42/44
about:blank 43/44
about:blank 44/44

MC4301 - ML Unit 3 (Bayesian Learning)

Uploaded by

Copyright:

Available Formats

MC4301 - ML Unit 3 (Bayesian Learning)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MC4301 - ML Unit 3 (Bayesian Learning)

Uploaded by

Copyright:

Available Formats

12/18/23, 6:23 PM MC4301 - ML Unit 3 (Bayesian Learning)

C. Abdul Hakeem College of Engineering & Technology

Basic Probability Notation

1. Information occurred from unreliable sources.

Probabilistic reasoning is a way of knowledge representation where we apply the

We use probability in probabilistic reasoning because it provides a way to handle the

Need of probabilistic reasoning in AI:

 When there are unpredictable outcomes.

As probabilistic reasoning uses probability and related terms, so before understanding

≤ P(A) ≤ 1, where P(A) is the probability of an event A.

P(A) = 0, indicates total uncertainty in an event A.

P(A) =1, indicates total certainty in an event A.

 P(¬A) = probability of a not happening event.

Event: Each possible outcome of a variable is called an event.

Prior probability: The prior probability of an event is probability computed before

Posterior Probability: The probability that is calculated after all evidence or

Conditional probability is a probability of occurring an event when another event has

⋀ B)= Joint probability of a and B

P(B)= Marginal probability of B.

Let, A is an event that a student likes Mathematics

B is an event that a student likes English.

Inference means to find a conclusion based on the facts, information, and

1) Addition: This inference rule is stated as follows:

2) Simplification: This inference rule states that:

4) Modus Tollens: This rule states that:

5) Forward Chaining: It is a type of deductive Inference rule. It states that:

9) Disjunctive Syllogism: This rule is stated as follows:

ML inference is generally deployed by DevOps engineers or data engineers.

How Does Machine Learning Inference Work?

To deploy a machine learning inference environment, you need three main

1. One or more data sources

Challenges of Machine Learning Inference

As mentioned earlier, the work in ML inference can sometimes be misallocated to the

Another challenge is attaining suitable performance for the workload. REST-based

1. The intuition of Conditional Independence

There is a single piece of information that will make A and B completely

The child’s age.

A: The height of a child

A better way to remember the expression:

Conditional independence is basically the concept of independence P(A ∩ B) = P(A)

Why is P(A|B ∩ C) = P(A|C) when (A ㅛ B)|C?

Here goes the proof.

The gist of conditional independence: Knowing C makes A and B independent.

P(A,B|C) = P(A|C) * P(B|C)

Why does the conditional independence even matter?

A. Conditional Independence in Bayesian Network (aka Graphical Models)

A Bayesian network represents a joint distribution using a graph. Specifically, it is

It looks like so:

In order for the Bayesian network to model a probability distribution, it relies on

For instance, we can simplify P(Grass Wet|Sprinkler, Rain) into P(Grass

What is so great about this approximation (Conditional Independent

Conditional independence between variables can greatly reduce the

Let’s take a look at the numbers.

Let’s say you have n binary variables (= n nodes).

The unconstrained joint distribution requires O(2^n) probabilities.

n = 30 binary variables, k = 4 maximum parents for nodes• Unconstrained Joint

We can have an efficient factored representation for a joint

B. Conditional Independence in Bayesian Inference

In a frequentist approach, we don’t assign the probability distribution to p.

On the other hand, in Bayesian inference, we assume p follows a distribution, not