Unit-3 Ai
Unit-3 Ai
Unit-3 Ai
Notes
Probability
Using Uncertain Knowledge- Agents don’t have complete knowledge about the world. Agents
need to make decisions based on their uncertainty. It isn’t enough to assume what the world is
like. Example: wearing a seat belt. An agent needs to reason about its uncertainty.
Why Probability?
There is lots of uncertainty about the world, but agents still need to act. Predictions are needed
to decide what to do: I definitive predictions: you will be run over tomorrow I point
probabilities: probability you will be run over tomorrow is 0.002 I probability ranges: you will
be run over with probability in range [0.001,0.34] Acting is gambling: agents who don’t use
probabilities will lose to those who do — Dutch books. Probabilities can be learned from data.
Bayes’ rule specifies how to combine data and prior knowledge. Probability is an agent’s
measure of belief in some proposition — subjective probability. An agent’s belief depends on
its prior assumptions and what the agent observes.
Belief in proposition, f , can be measured in terms of a number between 0 and 1 — this is the
probability of f . I The probability f is 0 means that f is believed to be definitely false. I The
probability f is 1 means that f is believed to be definitely true. Using 0 and 1 is purely a
convention. f has a probability between 0 and 1, means the agent is ignorant of its truth value.
Probability is a measure of an agent’s ignorance. Probability is not a measure of degree of
truth.
Random Variables
A random variable is a term in a language that can take one of a number of different values.
The range of a variable X, written range(X), is the set of values X can take. A tuple of random
variables hX1, . . . , Xni is a complex random variable with range range(X1) × · · · ×
range(Xn). Often the tuple is written as X1, . . . , Xn. Assignment X = x means variable X has
value x. A proposition is a Boolean formula made from assignments of values to variables.
A possible world specifies an assignment of one value to each random variable. A random
variable is a function from possible worlds into the range of the random variable. ω |= X = x
means variable X is assigned value x in world ω. Logical connectives have their standard
meaning: ω |= α ∧ β if ω |= α and ω |= β ω |= α ∨ β if ω |= α or ω |= β ω |= ¬α if ω 6|= α Let Ω
be the set of all possible worlds.
Semantics of Probability
For a finite number of possible worlds: Define a nonnegative measure µ(ω) to each world ω so
that the measures of the possible worlds sum to 1. The probability of proposition f is defined
by: P(f ) = X ω|=f µ(ω)
Axiom 2 P(true) = 1
Axiom 3 P(a ∨ b) = P(a) + P(b) if a and b cannot both be true. These axioms are sound and
complete with respect to the semantics.
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of
probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real
world.
Prior probability: The prior probability of an event is probability computed before observing
new information.
Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already
happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is already
occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes
English and mathematics, and then what is the percent of students those who like English also
like mathematics?
Solution:
Hence, 57% are the students who like English also like Mathematics.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events. Bayes' theorem was named after the British mathematician Thomas Bayes.
The Bayesian inference is an application of Bayes' theorem, which is fundamental to Bayesian
statistics. It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine
the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference. It shows the simple relationship between
joint and conditional probabilities. Here, P(A|B) is known as posterior, which we need to
calculate, and it will be read as Probability of hypothesis A when we have occurred an
evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate
the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability
that the card is king is 4/52, then calculate posterior probability P(King|Face), which
means the drawn face card is a king card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's
first understand the joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
Next, recall that conditional independence between two random variables, A and B, given
another random variable, C, is equivalent to satisfying the following property: P(A,B|C) =
P(A|C) * P(B|C). In other words, as long as the value of C is known and fixed, A and B are
independent. Another way of stating this, which we will use later on, is that P(A|B,C) = P(A|C).
In larger networks, this property allows us to greatly reduce the amount of required
computation, since generally, most nodes will have few parents relative to the overall size of the
network.
Inference
Inference over a Bayesian network can come in two forms.
The first is simply evaluating the joint probability of a particular assignment of values for each
variable (or a subset) in the network. For this, we already have a factorized form of the joint
distribution, so we simply evaluate that product using the provided conditional probabilities. If
we only care about a subset of variables, we will need to marginalize out the ones we are not
interested in. In many cases, this may result in underflow, so it is common to take the logarithm
of that product, which is equivalent to adding up the individual logarithms of each term in the
product.
The second, more interesting inference task, is to find P(x|e), or, to find the probability of some
assignment of a subset of the variables (x) given assignments of other variables (our evidence,
e). In the above example, an example of this could be to find P(Sprinkler, WetGrass | Cloudy),
where {Sprinkler, WetGrass} is our x, and {Cloudy} is our e. In order to calculate this, we use
the fact that P(x|e) = P(x, e) / P(e) = αP(x, e), where α is a normalization constant that we will
calculate at the end such that P(x|e) + P(¬x | e) = 1. In order to calculate P(x, e), we must
marginalize the joint probability distribution over the variables that do not appear in x or e,
which we will denote as Y.
For the given example, we can calculate P(Sprinkler, WetGrass | Cloudy) as follows:
We would calculate P(¬x | e) in the same fashion, just setting the value of the variables in x to
false instead of true. Once both P(x | e) and P(¬x | e) are calculated, we can solve for α, which
equals 1 / (P(x | e) + P(¬x | e)).
Note that in larger networks, Y will most likely be quite large, since most inference tasks will
only directly use a small subset of the variables. In cases like these, exact inference as shown
above is very computationally intensive, so methods must be used to reduce the amount of
computation. One more efficient method of exact inference is through variable elimination,
which takes advantage of the fact that each factor only involves a small number of variables.
This means that the summations can be rearranged such that only factors involving a given
variable are used in the marginalization of that variable. Alternatively, many networks are too
large even for this method, so approximate inference methods such as MCMC are instead used;
these provide probability estimations that require significantly less computation than exact
inference methods.
Hidden Markov Model
Hidden Markov Models or HMMs are the most common models used for dealing with temporal
Data. They also frequently come up in different ways in a Data Science Interview usually
without the word HMM written over it. In such a scenario it is necessary to discern the problem
as an HMM problem by knowing characteristics of HMMs.
In the Hidden Markov Model we are constructing an inference model based on the assumptions
of a Markov process.
The Markov process assumption is that the “future is independent of the past given that we know
the present”.
It means that the future state is related to the immediately previous state and not the states before
that. These are the first order HMMs.
What is Hidden?
With HMMs, we don’t know which state matches which physical events instead each state
matches a given output. We observe the output over time to determine the sequence of states.
Example: If you are staying indoors you will be dressed up a certain way. Lets say you want to
step outside. Depending on the weather, your clothing will change. Over time, you will observe
the weather and make better judgements on what to wear if you get familiar with the
area/climate. In an HMM, we observe the outputs over time to determine the sequence based on
how likely they were to produce that output.
Let us consider the situation where you have no view of the outside world when you are in a
building. The only way for you to know if it is raining outside it so see someone carrying an
umbrella when they come in. Here, the evidence variable is the Umbrella, while the hidden
variable is Rain. See the probabilities in the diagram above.
HMM representation
A number of related tasks ask about the probability of one or more of the latent variables, given
the model’s parameters and a sequence of observations which is sequence
of umbrella observations in our scenario.
Markov Decision Process
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In
particular, T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’
takes us to state S’ (S and S’ may be same). For stochastic actions (noisy, non-deterministic)
we also define a probability P(S’|S,a) which represents the probability of reaching a state S’
if action ‘a’ is taken in state S. Note Markov property states that the effects of an action taken
in a state depend only on that state and not on the prior history.
What is Actions?
An Action A is set of all possible actions. A(s) defines the set of actions that can be taken
being in state S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
****************************************************************************