Artificial Intelligence & Machine Learning Notes
1. To train the students to understand different types of AI agents.
2. To understand various AI search algorithms.
3. Fundamentals of knowledge representation, building of simple knowledge-based
systems and to apply knowledge representation.
4. To introduce the basic concepts and techniques of machine learning and the need for
Machine learning techniques for real world problem
5. To provide understanding of various Machine learning algorithms and the way to
evaluate the performance of ML algorithms
Introduction: AI problems, Agents and Environments, Structure of Agents, Problem Solving
Agents Basic Search Strategies: Problem Spaces, Uninformed Search (Breadth-First, Depth-First
Search, Depth-first with Iterative Deepening), Heuristic Search (Hill Climbing, Generic Best-First,
A*), Constraint Satisfaction (Backtracking, Local Search)
Advanced Search: Constructing Search Trees, Stochastic Search, AO* Search Implementation,
Minimax Search, Alpha-Beta Pruning Basic Knowledge Representation and Reasoning:
Propositional Logic, First-Order Logic, Forward Chaining and Backward Chaining, Introduction to
Probabilistic Reasoning, Bayes Theorem
Machine-Learning : Introduction. Machine Learning Systems, Forms of Learning: Supervised
and Unsupervised Learning, reinforcement – theory of learning – feasibility of learning – Data
Preparation– training versus testing and split.
Supervised Learning:
Regression: Linear Regression, multi linear regression, Polynomial Regression, logistic
regression, Non-linear Regression, Model evaluation methods. Classification: – support vector
machines ( SVM) , Naïve Bayes classification
Unsupervised learning
Nearest neighbor models – K-means – clustering around medoids – silhouettes – hierarchical
clustering – k-d trees ,Clustering trees – learning ordered rule lists – learning unordered rule .
Reinforcement learning- Example: Getting Lost -State and Action Spaces
1. Artificial Intelligence, Elaine Rich, Kevin Knight, Shivasankar B. Nair, The McGraw Hill
publications, Third Edition, 2009. 2. George F. Luger,
2. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Pearson
Education, 6th ed., 2009.
3. Introduction to Machine Learning, Second Edition, Ethem Alpaydın, the MIT Press,
Cambridge, Massachusetts, London, England.
4. Machine Learning , Tom M. Mitchell, McGraw-Hill Science, ISBN: 0070428077
5. Understanding Machine Learning:From Theory to Algorithms, c 2014 by ShaiShalev-
Shwartz and Shai Ben-David, Published 2014 by Cambridge University Press.
1. Understand the informed and uninformed problem types and apply search strategies to
solve them.
2. Apply difficult real life problems in a state space representation so as to solve those
using AI techniques like searching and game playing.
3. Apply machine learning techniques in the design of computer systems
4. To differentiate between various categories of ML algorithms
5. Design and make modifications to existing machine learning algorithms to suit an
attempt to build machines that like humans can think and act, able to learn and use
knowledge to solve problems on their own.
1) Game Playing
Deep Blue Chess program beat world champion Gary Kasparov
2) Speech Recognition
PEGASUS spoken language interface to American Airlines' EAASY SABRE reservation
system, which allows users to obtain flight information and make reservations over the
telephone. The 1990s has seen significant advances in speech recognition so that
limitedsystems are now successful.
3) Computer Vision
Face recognition programs in use by banks, government, etc. The ALVINN system from
CMU autonomously drove a van from Washington, D.C. to San Diego (all but 52 of 2,849
miles), averaging 63 mph day and night, and in all weather conditions. Handwriting
recognition, electronics and manufacturing inspection, photo interpretation, baggage
inspection, reverse engineering to automatically construct a 3D geometric model.
4) Expert Systems
Application-specific systems that rely on obtaining the knowledge of human experts in
anarea and programming that knowledge into a system.
a. Diagnostic Systems: MYCIN system for diagnosing bacterial infections of the
blood and suggesting treatments. Intellipath pathology diagnosis system (AMA
approved). Pathfinder medical diagnosis system, which suggests tests and
makesdiagnoses. Whirlpool customer assistance center.
b. System Configuration
DEC's XCON system for custom hardware configuration. Radiotherapy treatment planning.
c. Financial Decision Making
Credit card companies, mortgage companies, banks, and the U.S.
governmentemploy AI systems to detect fraud and expedite financial
transactions. For example, AMEX credit check.
d. Classification Systems
Put information into one of a fixed set of categories using several sources of
information. E.g., financial decision making systems. NASA developed a system for
classifying very faint areas in astronomical images into either stars or galaxies with
very high accuracy by learning from human experts' classifications.
5) Mathematical Theorem Proving
Use inference methods to prove new theorems.
6) Natural Language Understanding
AltaVista's translation of web pages. Translation of Catepillar Truck manuals into 20 languages.
7) Scheduling and Planning
Automatic scheduling for manufacturing. DARPA's DART system used in Desert Storm and
Desert Shield operations to plan logistics of people and supplies. American Airlines rerouting
contingency planner. European space agency planning and scheduling of spacecraft
assembly,integration and verification.
8) Artificial Neural Networks:
9) Machine Learning
Applications of AI:
AI algorithms have attracted close attention of researchers and have also been applied
successfullyto solve problems in engineering. Nevertheless, for large and complex problems, AI
algorithms consume considerable computation time due to stochastic feature of the search
Building AI Systems:
1) Perception
Intelligent biological systems are physically embodied in the world and experience the
world through their sensors (senses). For an autonomous vehicle, input might be images
from a camera and range information from a rangefinder. For a medical diagnosis
system, perception is the set of symptoms and test results that have been obtained and
input to thesystem manually.
2) Reasoning
Inference, decision-making, classification from what is sensed and what the internal "model" is of
the world. Might be a neural network, logical deduction system, Hidden Markov Model
induction,heuristic searching a problem space, Bayes Network inference, genetic algorithms, etc.
Includes areas of knowledge representation, problem solving, decision theory, planning, game
theory, machine learning, uncertainty reasoning, etc.
3) Action
Biological systems interact within their environment by actuation, speech, etc. All behavior is
centered around actions in the world. Examples include controlling the steering of a Mars rover or
autonomous vehicle, or suggesting tests and making diagnoses for a medical diagnosis system.
Includes areas of robot actuation, natural language generation, and speech synthesis.
The definitions of AI:
c) "The art of creating machines that d) "A field of study that seeks to explain
performfunctions that require and emulate intelligent behavior in
intelligence when performed by people" terms of computational processes"
(Kurzweil, 1990) (Schalkoff, 1 990)
"The branch of computer science
"The study of how to make that is concerned with the
computersdo things at which, at the automation of intelligent
moment, people are better" (Rich behavior"
and Knight, 1 (Luger and Stubblefield, 1993)
99 1 )
The definitions on the top, (a) and (b) are concerned with reasoning, whereas those on the
bottom, (c) and (d) address behavior. The definitions on the left, (a) and (c) measure success
interms of human performance, and those on the right, (b) and (d) measure the ideal concept
of intelligence called rationality
Intelligent Systems:
In order to design intelligent systems, it is important to categorize them into four
categories(Luger and Stubberfield 1993), (Russell and Norvig, 2003)
1. Systems that think like humans
2. Systems that think rationally
3. Systems that behave like humans
4. Systems that behave rationally
Human- Rationall
Like y
Cognitive Science: Think Human-Like
b. Focus is not just on behavior and I/O, but looks like reasoning process.
c. Goal is not just to produce human-like behavior but to produce a sequence of steps of
thereasoning process, similar to the steps followed by a human in solving the same task.
a. The study of mental faculties through the use of computational models; that it is,
thestudy of computations that make it possible to perceive reason and act.
b. Focus is on inference mechanisms that are probably correct and guarantee an optimal solution.
c. Goal is to formalize the reasoning process as a system of logical rules and procedures
a. The art of creating machines that perform functions requiring intelligence when
performed by people; that it is the study of, how to make computers do things which, at
the moment, people do better.
b. Focus is on action, and not intelligent behavior centered around the representation of the world
o The machine tries to fool the interrogator to believe that it is the
human, and the person also tries to convince the interrogator that it is
the human.
A human agent has eyes, ears, and other organs for sensors and hands, legs,
mouth,and other body parts for actuators.
A robotic agent might have cameras and infrared range finders for sensors
andvarious motors foractuators.
A software agent receives keystrokes, file contents, and network packets as
sensoryinputs and acts on the environment by displaying on the screen, writing
files, and sending network packets.
We use the term percept to refer to the agent's perceptual inputs at any given instant.
Percept Sequence:
An agent's percept sequence is the complete history of everything the agent has ever perceived.
Agent function:
Mathematically speaking, we say that an agent's behavior is described by the agent
functionthat maps any given percept sequence to an action.
Agent program
Internally, the agent function for an artificial agent will be implemented by an agent
program. It is important to keep these two ideas distinct. The agent function is an
Fig 2.1.6: Partial tabulation of a simple agent function for the example: vacuum-cleaner world shown in the
Fig 2.1.6(i): The REFLEX-VACCUM-AGENT program is invoked for each new percept
(location, status) and returns an action each time
A Rational agent is one that does the right thing. we say that the right action is the one that
willcause the agent to be most successful. That leaves us with the problem of deciding how
and when to evaluate the agent's success.
We use the term performance measure for the how—the criteria that determine how
successfulan agent is.
Ex-Agent cleaning the dirty floor
Performance Measure-Amount of dirt collected
When to measure-Weekly for better results
The Performance measure, the environment and the agents actuators and sensors comes under
the heading task environment. We also call this as
5. Discrete vs. continuous:
If there are a limited number of distinct, clearly defined percepts and actions we
saythat the environment is discrete. Otherwise, it is continuous.
The job of AI is to design the agent program: a function that implements the agent mapping
from percepts to actions. We assume this program will run on some sort of ARCHITECTURE
computing device, which we will call the architecture.
The architecture might be a plain computer, or it might include special-purpose hardware for
certain tasks, such as processing camera images or filtering audio input. It might also include
software that provides a degree of insulation between the raw computer and the agent
program,so that we can program at a higher level. In general, the architecture makes the
percepts from the sensors available to the program, runs the program, and feeds the
program's action choices to the effectors as they are generated.
The relationship among agents, architectures, and programs can be summed up as
follows:agent = architecture + program
Agent programs:
Intelligent agents accept percepts from an environment and generates actions. The
earlyversions of agent programs will have a very simple form (Figure 2.4)
Each will use some internal data structures that will be updated as new percepts arrive.
These data structures are operated on by the agent's decision-making procedures to
generate anaction choice, which is then passed to the architecture to be executed
Types of agents:
Agents can be grouped into four classes based on their degree of perceived intelligence and capability :
Simple Reflex Agents
Model-Based Reflex Agents
Goal-Based Agents
Utility-Based Agents
Simple reflex agents:
Simple reflex agents ignore the rest of the percept history and act only on the basis
ofthe current percept.
The agent function is based on the condition-action rule.
If the condition is true, then the action is taken, else not. This agent function only succeeds
whenthe environment is fully observable.
Goal-based agents:
Utility-based agents:
A utility-based agent is an agent that acts based not only on what the goal is, but the best way to reach
The Utility-based agent is useful when there are multiple possible alternatives, and an agent has
tochoose in order to perform the best action.
The term utility can be used to describe how "happy" the agent is.
Problem Solving Agents:
Problem solving agent is a goal-based agent.
Problem solving agents decide what to do by finding sequence of actions that lead to desirable states.
Goal Formulation:
It organizes the steps required to formulate/ prepare one goal out of multiple goals available.
Problem Formulation:
It is a process of deciding what actions and states to consider to follow goal
formulation.The process of looking for a best sequence to achieve a goal is called
A search algorithm takes a problem as input and returns a solution in the form of action sequences.
Once the solution is found the action it recommends can be carried out. This is called Execution
phase.Well Defined problems and solutions:
A problem can be defined formally by 4 components:
The initial state of the agent is the state where the agent starts in. In this case, the initial state can be
described as In: Arad
The possible actions available to the agent, corresponding to each of the state the agent
For example, ACTIONS(In: Arad) = {Go: Sibiu, Go: Timisoara, Go: Zerind}.
Actions are also known as operations.
A description of what each action does.the formal name for this is Transition model,Specified by
thefunction Result(s,a) that returns the state that results from the action a in state s.
We also use the term Successor to refer to any state reachable from a given state by a single
action.For EX:Result(In(Arad),GO(Zerind))=In(Zerind)
Together the initial state,actions and transition model implicitly defines the state space of the
problemState space: set of all states reachable from the initial state by any sequence of actions
The goal test, determining whether the current state is a goal state. Here, the goal state is
The path cost function, which determine the cost of each path, which is reflecting in
theperformance measure.
we define the cost function as c(s, a, s’), where s is the current state and a is the action performed by
theagent to reach state s’.
Example –
8 puzzle problem
Initial State
Goal State
States: a state description specifies the location of each of the eight tiles in one of the
ninesquares. For efficiency, it is useful to include the location of the blank.
Actions: blank moves left, right, up, or down.
Transition Model: Given a state and action, this returns the resulting state. For example if
weapply left to the start state the resulting state has the 5 and the blank switched.
Goal test: state matches the goal configuration shown in fig.
Path cost: each step costs 1, so the path cost is just the length of the path.
State Space Search/Problem Space Search:
The state space representation forms the basis of most of the AI methods.
Formulate a problem as a state space search by showing the legal problem states, the
legal operators, and the initial and goal states.
A state is defined by the specification of the values of all attributes of interest in the world
An operator changes one state into the other; it has a precondition which is the value of
certain attributes prior to the application of the operator, and a set of effects, which
arethe attributes altered by the operator
The initial state is where you start
The goal state is the partial description of the solution
Example: 8-queens problem
Search strategies:
Search: Searching is a step by step procedure to solve a search-problem in a given search space. A
search problem can have three main factors:
Search Space: Search space represents a set of possible solutions, which a system may
have.Start State: It is a state from where agent begins the search.
Goal test: It is a function which observe the current state and returns whether the goal state is
achievedor not.
Properties of Search Algorithms
Which search algorithm one should use will generally depend on the
problemdomain. There are four important factors to consider:
2. Optimality – Is the solution found guaranteed to be the best (or lowest cost) solution if
3. Time Complexity – The upper bound on the time required to find a solution, as a function
Many traditional search algorithms are used in AI applications. For complex problems, the traditional
algorithms are unable to find the solution within some practical time and space limits. Consequently,
many special techniques are developed; using heuristic functions. The algorithms that use heuristic
functions are called heuristic algorithms. Heuristic algorithms are not really intelligent; they appear to
be intelligent because they achieve better performance.
Heuristic algorithms are more efficient because they take advantage of feedback from the data to
directthe search path.
Uninformed search
Also called blind, exhaustive or brute-force search, uses no information about the problem to guide
thesearch and therefore may not be very efficient.
Informed Search:
Also called heuristic or intelligent search, uses information about the problem to guide the search, usually
guesses the distance to a goal state and therefore efficient, but the search may not be always possible.
One simple search strategy is a breadth-first search. In this strategy, the root node is
expanded first, then all the nodes generated by the root node are expanded next, and
thentheir successors, and so on.
In general, all the nodes at depth d in the search tree are expanded before the nodes at depth d
+ 1.
BFS illustrated:
Step 1: Initially frontier contains only one node corresponding to the source state A.
Figure 1
Frontier: A
Step 2: A is removed from fringe. The node is expanded, and its children B and C are generated.
They are placed at the back of fringe.
Figure 2
Frontier: B C
Step 3: Node B is removed from fringe and is expanded. Its children D, E are generated and
Step 4: Node C is removed from fringe and is expanded. Its children D and G are added to
theback of fringe.
Figure 4
Frontier: D E D G
Step 5: Node D is removed from fringe. Its children C and F are generated and added to the
backof fringe.
Figure 5
Frontier: E D G C F
Figure 6
Frontier: D G C F
Figure 7
Frontier: G C F B F
Step 8: G is selected for expansion. It is found to be a goal node. So the algorithm returns
thepath A C G by following the parent pointers of the node corresponding to G. The
algorithm terminates.
BFS will provide a solution if any solution exists.
If there are more than one solutions for a given problem, then BFS will provide the minimal
solutionwhich requires the least number of steps.
Requires the generation and storage of a tree whose size is exponential the depth of
theshallowest goal node.
The breadth first search algorithm cannot be effectively used unless the search space
isquite small.
Applications Of Breadth-First Search Algorithm
GPS Navigation systems: Breadth-First Search is one of the best algorithms used to find neighboring
locations by using the GPS system.
Broadcasting: Networking makes use of what we call as packets for communication. These packets
follow a traversal method to reach various networking nodes. One of the most commonly used
DFS illustrated:
Figure 1
Step 2: A is removed from fringe. A is expanded and its children B and C are put in front of
Figure 2
Step 3: Node B is removed from fringe, and its children D and E are pushed in front of fringe.
Figure 3
Step 4: Node D is removed from fringe. C and F are pushed in front of fringe.
Figure 4
Step 5: Node C is removed from fringe. Its child G is pushed in front of fringe.
Figure 5
Figure 5
Step 6: Node G is expanded and found to be a goal node.
Figure 6
Note that the time taken by the algorithm is related to the maximum depth of the search tree. If the
search tree has infinite depth, the algorithm may not terminate. This can happen if the search space is
infinite. It can also happen if the search space contains cycles. The latter case can be handled by
checking for cycles in the algorithm. Thus Depth First Search is not complete.
It combines the benefits of BFS and DFS search algorithm in terms of fast search and
The main drawback of IDDFS is that it repeats all the work of the previous phase.
Iterative deepening search L=1
Complete: Yes
Time: O(bd)
Space: O(bd)
We can conclude that IDS is a hybrid search strategy between BFS and DFS inheriting
their advantages.
IDS is faster than BFS and DFS.
Itissaidthat “IDSisthepreferreduniformedsearchmethodwhen thereisalargesearchspace and the
depthof the solution is not known
might not always find the best solution but is guaranteed to find a good solution
inreasonable time. By sacrificing completeness it increases efficiency.
Useful in solving tough problems which
o could not be solved any other way.
o solutions take an infinite time or very long time to compute.
Source state
1 3 2
6 5 4
8 7
destination state
1 2 3
4 5 6
7 8
Then the Manhattan distance would be sum of the no of moves required to move
eachnumber from source state to destination state.
Number in 8 1 2 3 4 5 6 7 8
No. of moves 0 2 1 2 0 2 2 0
to reach
Source state
1 3 2
6 5 4
8 7
Destination state
1 2 3
4 5 6
7 8
Here just calculate the number of tiles that have to be changed to reach goal
stateHere 1,5,8 need not be changed
2,3,4,6,7 should be changed, so the heuristic value will be 5(because 5 tiles have to be changed)
Hill climbing algorithm is a local search algorithm which continuously moves in the
direction of increasing elevation/value to find the peak of the mountain or best solution
tothe problem. It terminates when it reaches a peak value where no neighbor has a
higher value.
It is also called greedy local search as it only looks to its good immediate neighbor
stateand not beyond that.
Hill Climbing is mostly used when a good heuristic is available.
In this algorithm, we don't need to maintain and handle the search tree or graph as it
onlykeeps a single current state.
Local Maximum: Local maximum is a state which is better than its neighbor states, but there is also another state which is
higher than it.
Global Maximum: Global maximum is the best possible state of state space landscape. It has the highest value of objective
Flat local maximum: It is a flat space in the landscape where all the neighbor states of current states have the same value.
Problems in Hill Climbing Algorithm:
A hill-climbing algorithm that never makes “downhill” moves towards states with lower value
(or higher cost) is guaranteed to be incomplete, because it can stuck on a local maximum. In
contrast, a purely random walk –that is, moving to a successor chosen uniformly at random
from the set of successors – is complete, but extremely inefficient. Simulated annealing is an
algorithm that combines hill-climbing with a random walk in some way that yields both
efficiency and completeness.
simulated annealing algorithm is quite similar to hill climbing. Instead of picking the best
move, however, it picks the random move. If the move improves the situation, it is always
accepted. Otherwise, the algorithm accepts the move with some probability less than 1.
probability decreases exponentially with the “badness” of the move – the amount E by which
the evaluation is worsened. The probability also decreases as the "temperature" T goes down: "bad
moves are more likely to be allowed at the start when temperature is high, and they become more
unlikely as T decreases. One can prove that if the schedule lowers T slowly enough, the algorithm will
find a global optimum with probability approaching 1.
Simulated annealing was first used extensively to solve VLSI layout problems. It has been applied widely
to factory scheduling and other large-scale optimization tasks.
OPEN is a priority queue of nodes that have been evaluated by the heuristic function but
whichhave not yet been expanded into successors. The most promising nodes are at the front.
CLOSED are nodes that have already been generated and these nodes must be stored because
agraph is being used in preference to a tree.
• If it has been generated before change the parent if this new path is
betterand in that case update the cost of getting to any successor
1. It is not optimal.
2. It is incomplete because it can start down an infinite path and never return to try
3. The worst-case time complexity for greedy search is O (bm), where m is the
maximumdepth of the search space.
4. Because greedy search retains all nodes in memory, its space complexity is the same
asits time complexity
A* Algorithm
The A* search algorithm (pronounced "Ay-star") is a tree search algorithm that finds a path
from a given initial node to a given goal node (or one passing a given goal test). It employs a
"heuristic estimate" which ranks each node by an estimate of the best route that goes through
thatnode. It visits the nodes in order of this heuristic estimate.
Similar to greedy best-first search but is more accurate because A* takes into account the
nodes that have already been traversed.
g is a measure of the distance/cost to go from the initial node to the current node
Thus fis an estimate of how long it takes to go from the initial node to the solution
save n in CLOSED
a) If m € [OPEN U
CLOSED] Set g(m) =
g(n) + c(n , m) Set f(m)
= g(m) + h(m)
Insert m in OPEN
b) If m € [OPEN U CLOSED]
Move m to OPEN.
A* begins at a selected node. Applied to this node is the "cost" of entering this node
(usually zero for the initial node). A* then estimates the distance to the goal node
from the current node. This estimate and the cost added together are the heuristic
which is assigned to the path leading to this node. The node is then added to a priority
queue, oftencalled "open".
The algorithm then removes the next node from the priority queue (because of the way
a priority queue works, the node removed will have the lowest heuristic). If the queue is
empty, there is no path from the initial node to the goal node and the algorithm stops. If
the node is the goal node, A* constructs and outputs the successful path and stops.
If the node is not the goal node, new nodes are created for all admissible adjoining
nodes;the exact way of doing this depends on the problem at hand. For each successive
node, A* calculates the "cost" of entering the node and saves it with the node. This cost
is calculated from the cumulative sum of costs stored with its ancestors, plus the cost of
the operation which reached this new node.
The algorithm also maintains a 'closed' list of nodes whose adjoining nodes have been
checked. If a newly generated node is already in this list with an equal or lower cost, no
further processing is done on that node or with the path associated with it. If a node in
the closed list matches the new one, but has been stored with a higher cost, it is
removed from the closed list, and processing continues on the new node.
Next, an estimate of the new node's distance to the goal is added to the cost to form the
heuristic for that node. This is then added to the 'open' priority queue, unless an identical node
is found there.
Once the above three steps have been repeated for each new adjoining node, the original node
taken from the priority queue is added to the 'closed' list. The next node is then popped from
the priority queue and the process is repeated
The heuristic costs from each city to Bucharest:
A* search properties:
The algorithm A* is admissible. This means that provided a solution exists, the first
solutionfound by A* is an optimal solution. A* is admissible under the following
A* is also complete.
Sometimes a problem is not embedded in a long set of action sequences but requires picking the
best option from available choices. A good general-purpose problem solving technique is to list
the constraints of a situation (either negative constraints, like limitations, or positive elements
that you want in the final solution). Then pick the choice that satisfies most of the constraints.
Formally speaking, a constraint satisfaction problem (or CSP) is defined by a set of variables,
X1;X2; : : :
;Xn, and a set of constraints, C1;C2; : : : ;Cm. Each variable Xi has anonempty domain Di of
possible values. Each constraint Ci involves some subset of tvariables and specifies the allowable
combinations of values for that subset. A state of theproblem is defined by an assignment of
values to some or all of the variables, {Xi = vi;Xj =vj ; : : :} An assignment that does not violate any
constraints is called a consistent or
legalassignment. A complete assignment is one in which every variable is mentioned, and a
solution to a CSP is a complete assignment that satisfies all the constraints. Some CSPs also
require a solution that maximizes an objectivefunction.
1. Initial state: the empty assignment fg, in which all variables are unassigned.
2. Successor function: a value can be assigned to any unassigned variable, provided that it
doesnot conflict with previously assigned variables.
3. Goal test: the current assignment is complete.
4. Path cost: a constant cost for every step
The task of coloring each region red, green or blue in such a way that no
neighboringregions have the same color.
We are given the task of coloring each region red, green, or blue in such a way that
theneighboring regions must not have the same color.
To formulate this as CSP, we define the variable to be the regions: WA, NT, Q, NSW, V, SA, and
T. The domain of each variable is the set {red, green, blue}. The constraints require
neighboring regions to have distinct colors: for example, the allowable combinations
forWA and NT are the pairs
{(red,green),(red,blue),(green,red),(green,blue),(blue,red),(blue,green)}. (The constraint
can also be represented as the inequality WA ≠ NT). There aremany possible solutions,
such as {WA = red, NT = green, Q = red, NSW = green, V = red, SA = blue, T = red}.Map
of Australia showing each of its states and territories
Constraint Graph: A CSP is usually represented as an undirected graph, called
constraint graph where the nodes are the variables and the edges are the
problemas follows:
> Initial state : the empty assignment {},in which all variables are unassigned.
> Successor function: a value can be assigned to any unassigned variable, provided
that it does not conflict with previously assigned variables.
> Goal test: the current assignment is complete.
> Path cost: a constant cost(E.g.,1) for every step.
Advanced Search: Constructing Search Trees, Stochastic Search, AO* Search Implementation, Minimax
Search, Alpha-Beta Pruning Basic Knowledge Representation and Reasoning: Propositional Logic, First-
Order Logic, Forward Chaining and Backward Chaining, Introduction to Probabilistic Reasoning, Bayes
Game Playing
Adversarial search, or game-tree search, is a technique for analyzing an adversarial game in order to try
to determine who can win the game and what moves the players should make in order to win.
Adversarial search is one of the oldest topics in Artificial Intelligence. The original ideas for adversarial
search were developed by Shannon in 1950 and independently by Turing in 1951, in the context of the
game of chess—and their ideas still form the basis for the techniques used today.
2- Person Games:
Properties of minimax:
– Not always feasible to traverse entire tree
– Time limitations
1) Setup phase: Assign to each left-most (or right-most) internal node of the tree,
variables: alpha = -infinity, beta = +infinity
2) Look at first computed final configuration value. It’s a 3. Parent is a min node, so
set the beta (min) value to 3.
3) Look at next value, 5. Since parent is a min node, we want the minimum of 3
and 5 which is 3. Parent min node is done – fill alpha (max) value of its parent max node.
Always set alpha for max nodes and beta for min nodes. Copy the state of the max parent node
into the second unevaluated min child.
4) Look at next value, 2. Since parent node is min with b=+inf, 2 is smaller, change b.
5) Now, the min parent node has a max value of 3 and min value of 2. The value of the 2nd
child does not matter. If it is >2, 2 will be selected for min node. If it is <2, it will be selected for
min node, but since it is <3 it will not get selected for the parent max node. Thus, we prune the
right subtree of the min node. Propagate max value up the tree.
6) Max node is now done and we can set the beta value of its parent and propagate node
state to sibling subtree’s left-most path.
7) The next node is 10. 10 is not smaller than 3, so state of parent does not change. We still
have to look at the 2nd child since alpha is still –inf.
8) The next node is 4. Smallest value goes to the parent min node. Min subtree is done, so
the parent max node gets the alpha (max) value from the child. Note that if the max node
had a 2nd subtree, we can prune it since a>b.
9) Continue propagating value up the tree, modifying the corresponding alpha/beta values.
Also propagate the state of root node down the left-most path of the right subtree.
10) Next value is a 2. We set the beta (min) value of the min parent to 2. Since no other
children exist, we propagate the value up the tree.
11) We have a value for the 3rd level max node, now we can modify the beta (min) value of
the min parent to 2. Now, we have a situation that a>b and thus the value of the rightmost
subtree of the min node does not matter, so we prune the whole subtree.
12) Finally, no more nodes remain, we propagate values up the tree. The root has a value of
3 that comes from the left-most child. Thus, the player should choose the left-most
child’s move in order to maximize his/her winnings. As you can see, the result is the same
as with the mini-max example, but we did not visit all nodes of the tree.
AO* Search: (And-Or) Graph:
The Depth first search and Breadth first search given earlier for OR trees or graphs can be easily
adopted by AND-OR graph. The main difference lies in the way termination conditions are
determined, since all goals following an AND nodes must be realized; where as a single goal node
following an OR node will do. So for this purpose we are using AO* algorithm.
Like A* algorithm here we will use two arrays and one heuristic function.
It contains the nodes that has been traversed but yet not been marked solvable or
It contains the nodes that have already been
processed.6 7:The distance from current node to goal
Step 1: Place the starting node into OPEN.
Step 2: Compute the most promising solution tree say T0.
Step 3: Select a node n that is both on OPEN and a member of T0. Remove it from OPEN and place it
Step 4: If n is the terminal goal node then leveled n as solved and leveled all the ancestors of n
assolved. If the starting node is marked as solved then success and exit.
Step 5: If n is not a solvable node, then mark n as unsolvable. If starting node is marked as
unsolvable,then return failure and exit.
Step 6: Expand n. Find all its successors and find their h (n) value, push them into OPEN.
Step 7: Return to Step 2.
Step 8: Exit.
Let us take the following example to implement the AO* algorithm.
Step 1:
In the above graph, the solvable nodes are A, B, C, D, E, F and the unsolvable nodes are G, H. Take A as
the starting node. So place A into OPEN.
It is an optimal algorithm.
If traverse according to the ordering of nodes. It can be used for both OR and AND graph.
Sometimes for unsolvable nodes, it can’t find the optimal path. Its complexity is than other
• But how machines do all these things comes under knowledge representation
• There are three factors which are put into the machine, which makes it valuable:
• Knowledge: The information related to the environment is stored in the machine.
• Reasoning: The ability of the machine to understand the stored knowledge.
• Intelligence: The ability of the machine to make decisions on the basis of the stored
• A knowledge representation language is defined by two aspects:
• The syntax of a language describes the possible configurations that can constitute sentences.
• The semantics determines the facts in the world to which the sentences refer.
• For example, the syntax of the language of arithmetic expressions says that if x
and y are expressions denoting numbers, then x > y is a sentence about numbers. The
semantics of the languagesays that x > y is false when y is a bigger number than x, and
true otherwise From the syntax and semantics, we can derive an inference mechanism for
an agent that uses the language.
• Recall that the semantics of the language determine the fact to which a given
sentence refers.Facts are part of the world,
• whereas their representations must be encoded in some way that can be physically
stored within anagent. We cannot put the world inside a computer (nor can we put it
inside a human), so all reasoning mechanisms must operate on representations of facts,
rather than on the facts themselves.Because sentences are physical configurations of
parts of the agent,
Reasoning must be a process of constructing new physical configurations from old ones.
Proper reasoning should ensure that the new configurations represent facts that
actually follow from thefacts that the old configurations represent.
• We want to generate new sentences that are necessarily true, given that the old sentences are
true.This relation between sentences is called entailment.
• Propositional logic (PL) is the simplest form of logic where all the statements
are made bypropositions.
• The symbols of prepositional logic are the logical constants True and False,
proposition symbolssuch as P and Q, the logical connectives A, V, <=>, =>and and
• All sentences are made by putting these symbols together using the following rules:
P= Rohan is intelligent,
Q= Rohan is hardworking. → P𝖠 Q.
as P ⇔ Q.Precedence of connectives:
Precedence Operators
First Precedence
Parenthesis Second
Third Precedence
Fourth Precedence
Precedence of connectives:
• Truth tables can be used not only to define the connectives, but also to test for valid
• Given a sentence, we make a truth table with one row for each of the possible
combinations of truth values for the proposition symbols in the sentence.
• If the sentence is true in every row, then the sentence is valid. For example, the
sentence ((P V H) A ¬H) => P
• P: It is Hot
• Q: It is Humid
• R:It is raining
1. If it is humid
then it is hotQ->P
it is raining(P A Q)->R
• In propositional logic, we can only represent the facts, which are either true or false.
First-order logic:
The declarative nature of propositional logic, specify that knowledge and inference are
separate, and inference is entirely domain-independent. Propositional logic is a
declarative language because its semantics is based on a truth relation between
sentences and possible worlds.
It also has sufficient expressive power to deal with partial information, using
disjunction and negation.
For example, we were forced to write a separate rule about breezes and pits for each
square, such asB1,1⇔ (P1,2 ∨P2,1) .
In English, itseemseasyenoughtosay,“Squares adjacenttopitsarebreezy.”
The syntax and semantics of English somehow make it possible to describe the environment
The models of a logical language are the formal structures that constitute the possible worlds
under consideration. Each model links the vocabulary of the logical sentences to elements of
the possible world, so that the truth of any sentence can be determined. Thus, models for
propositional logic link proposition symbols to predefined truth values. Models for first-order
logic have objects. The domain of a model is the set of objects or domain elements it
contains. The domain is required to be nonempty—every possible world must contain at least
one object.
A relation is just the set of tuples of objects that are related.
Unary Relation: Relations relates to single Object Binary Relation: Relation Relates to
multiple objects Certain kinds of relationships are best considered as functions, in that a given
object must be related to exactly one object.
For Example:
Richard the Lionheart, King of England from 1189 to 1199; His younger brother, the evil King
John, whoruled from 1199 to 1215; the left legs of Richard and John; crown
Unary Relation : John is a king Binary Relation :crown is on head of john , Richard is brother
ofjohn The unary "left leg" function includes the following mappings: (Richard the Lionheart) -
>Richard's left leg (King John) ->Johns left Leg
Symbols are the basic syntactic elements of first-order logic. Symbols stand for objects,
relations, and functions.
The symbols are of three kinds: Constant symbols which stand for objects; Example: John,
Richard Predicate symbols, which stand for relations; Example: OnHead, Person, King, and
Function symbols, which stand for functions. Example: left leg Symbols will begin with
uppercase letters.
Interpretation The semantics must relate sentences to models in order to determine truth. For
this to happen, we need an interpretation that specifies exactly which objects, relations and
functions are referredto by the constant, predicate, and function symbols.
For Example:
Richard refers to Richard the Lionheart and John refers to the evil king John. Brother refers
to the brotherhood relation OnHead refers to the "on head relation that holds between the
crown and King John; Person, King, and Crown refer to the sets of objects that are persons,
kings, and crowns. LeftLeg refers to the "left leg" function,
The truth of any sentence is determined by a model and an interpretation for the
sentence's symbols. Therefore, entailment, validity, and so on are defined in terms of all
possiblemodels and all possible interpretations. The number of domain elements in each
model may be unbounded-for example, the domain elements may be integers or real
numbers. Hence, the number of possible models is anbounded, as is the number of
A term is a logical expression that refers to an object. Constant symbols are therefore
terms. ComplexTerms A complex term is just a complicated kind of name. A complex
term is formed by a functionsymbol followed by a parenthesized list of terms as arguments
to the function symbol For example: "KingJohn's left leg" Instead of using a constant symbol,
we use LeftLeg(John). The formal semantics of terms Consider a term f (tl,. . . , t,). The function
symbol frefers to some function in the model (F); the argumentterms refer to objects in the
domain (call them d1….dn); and the term as a whole refers to the object that is thevalue of the
function Fapplied to dl, . . . , d,. For example,: the LeftLeg function symbol refers to the
function “ (King John) -+ John's left leg” and John refers to King John, then LeftLeg(John) refers
to KingJohn's left leg. In this way, the interpretation fixes the referent of every term.
Atomic sentences
Complex sentences Complex sentences can be constructed using logical Connectives, just as in
Thus, the sentence says, “For all x, if x is a king, then x is a person.” The symbol x is called
a variable. Variables are lowercase letters. A variable is a term all by itself, and can also
serve as the argument of a function A term with no variables is called a ground term.
Assume we can extend the interpretation in different ways: x→ Richard the Lionheart, x→ King John, x→
Richard’s left leg, x→ John’s left leg, x→ the crown
The universally quantified sentence ∀x King(x) ⇒Person(x) is true in the original model if the sentence
⇒Person(x) is true under each of the five extended interpretations. That is, the universally quantified
sentence is equivalent to asserting the following five sentences:
Richard the Lionheart is a king ⇒Richard the Lionheart is a person. King John is a king ⇒King John is a
person. Richard’sleftlegisaking⇒Richard’sleftleg is aperson. John’s left legisaking⇒John’s left leg is a
person. The crown is a king ⇒the crown is a person.
Universal quantification makes statements about every object. Similarly, we can make a statement about
some objectin the universe without naming it, by using an existential quantifier.
“The sentence ∃x P says that P is true for at least one object x. More precisely, ∃x P is true in a given model
if P is truein at least one extended interpretationthat assigns x to a domain element.” ∃x is pronounced
“There exists an x such that . .
.” or “For some x . . .”.
For example, that King John has a crown on his head, we write ∃xCrown(x) 𝖠OnHead(x, John) Given assertions:
Richard the Lionheart is a crown 𝖠Richard the Lionheart is on John’s head; King John is a crown
𝖠King Johnison John’s head; Richard’s left legisacrown𝖠Richard’s leftlegison John’s head; John’s left leg is a crown
𝖠John’s left leg is on John’s head; The crown is a crown 𝖠the crown is on John’s head. The fifth assertion is
true in the model, so the original existentially quantified sentence is true in the model. Just as ⇒appears
to be the natural connective to use with ∀, 𝖠is the natural connective to use with ∃.
Nested quantifiers
For example, “Brothers are siblings” can be written as ∀x∀y Brother (x, y) ⇒Sibling(x, y). Consecutive
quantifiers ofthe same type can be written as one quantifier with several variables.
For example, to say that siblinghood is a symmetric relationship, we can write∀x, y Sibling(x, y)
For example: 1. “Everybody loves somebody” means that for every person, there is someone that person
loves: ∀x∃yLoves(x, y) . 2. On the other hand, to say “There is someone who is loved by everyone,” we
write∃y∀x Loves(x, y) .
Universal and Existential quantifiers are actually intimately connected with each other, through negation.
Example assertions:
1. “ Everyone dislikes medicine” is the same as asserting “ there does not exist someone who likes medicine” ,
and vice versa:“∀x ¬Likes(x, medicine)” is equivalent to “¬∃x Likes(x,medicine)”.
Because ∀is really a conjunction over the universe of objects and ∃is a disjunction that
they obey DeMorgan’s rules. The De Morgan rules for quantified and unquantified
sentences are as follows:
First-order logic includes one more way to make atomic sentences, other than using a predicateand terms
.We can use the equality symbol to signify that two terms refer to the same object.
For example,
“Father(John) =Henry” says that the object referred to by Father (John) and the object
referred to byHenry are the same.
Because an interpretation fixes the referent of any term, determining the truth of an equality sentence is
simply a matter of seeing that the referents of the two terms are the same object.The equality symbol can
be used to state facts about a given function.It can also be used with negation to insist that two terms are
not the same object.
For example,
“Richard has at least two brothers” can be written as, ∃x, y Brother (x,Richard ) 𝖠Brother (y,Richard
) 𝖠¬(x=y) .Thse∃x, y Brother (x,Richard ) 𝖠Brother (y,Richard ) does not have the intended meaning.
In particular, it is true only in the model where Richard has only one brother considering the extended
interpretation in which both x and y are assigned to King John. The addition of ¬(x=y) rules out such
USING FIRST ORDER LOGIC Assertions and queries in first-order
Sentences are added to a knowledge base using TELL, exactly as in propositional logic. Such
sentences arecalled assertions.
For example,
John is a king, TELL (KB, King (John)). Richard is a person. TELL (KB, Person (Richard)). All kings
arepersons: TELL (KB, ∀x King(x) ⇒Person(x)).
Asking Queries:
We can ask questions of the knowledge base using ASK. Questions asked with ASK are called
queries orgoals.
For example,
Any query that is logically entailed by the knowledge base should be answered
The answer is true, but this is perhaps not as helpful as we would like. It is rather like
answering“Can you tell me the time?” with “Yes.”
If we want to know what value of x makes the sentence true, we will need a different
function, ASKVARS, which we call with ASKVARS (KB, Person(x)) and which yields a
stream of answers.
In this case there will be two answers: {x/John} and {x/Richard}. Such an answer is called a
substitutionor binding list.
ASKVARS is usually reserved for knowledge bases consisting solely of Horn clauses,
because in suchknowledge bases every way of making the query true will bind the variables
to specific values.
We use functions for Mother and Father, because every person has exactly one of each of these.
We can represent each function and predicate, writing down what we know in termsof the other
For example:-
1. one’s mother is one’s female parent: ∀m, c Mother (c)=m ⇔Female(m) 𝖠Parent(m,
2. One’s husband is one’s male spouse: ∀w, h Husband(h,w) ⇔Male(h) 𝖠Spouse(h,w) .
4. Parent and child are inverse relations: ∀p, c Parent(p, c) ⇔Child (c, p) .
Each of these sentences can be viewed as an axiom of the kinship domain. Axioms are commonly
associated with purely mathematical domains. They provide the basic factual information
from which useful conclusions can be derived.
Kinship axioms are also definitions; they have the form ∀x, y P(x, y) ⇔. . ..
The axioms define the Mother function, Husband, Male, Parent, Grandparent, and Sibling
predicates in termsof other predicates.
Our definitions “bottom out” at a basic set of predicates (Child, Spouse, and Female) in terms of
which the others are ultimately defined. This is a natural way in which to build up the
representation of a domain, and it is analogous to the way in which software packages
are built up by successive definitions of subroutines from primitive library functions.
Not all logical sentences about a domain are axioms. Some are theorems—that is, they are
entailed by the axioms.
For example, consider the assertion that siblinghood is symmetric: ∀x, y Sibling(x, y) ⇔Sibling(y, x) .
It is a theorem that follows logically from the axiom that defines siblinghood. If we ASK the
knowledge base this sentence, it should return true. From a purely logical point of view, a
knowledge base need contain only axioms and no theorems, because the theorems do not
increase the set of conclusions that follow from the knowledge base. From a practical point of
view, theorems are essential to reduce the computational cost of deriving new sentences.
Without them, a reasoning system has to start from first principles every time.
Not all axioms are definitions. Some provide more general information about certain predicates
∀xPerson(x) ⇔. . .
Fortunately, first-order logic allows us to make use of the Person predicate without completely
defining it. Instead, we can write partial specifications of properties that every person has and
properties that make something a person:
∀xPerson(x) ⇒. . . ∀x . . . ⇒Person(x) .
Axioms can also be “just plain facts,” such as Male (Jim) and Spouse (Jim, Laura).Such facts form
the descriptions of specific problem instances, enabling specific questions to be answered.
The answers tothese questions will then be theorems that follow from the axioms
Numbers are perhaps the most vivid example of how a large theory can be built up from
NATURAL NUMBERS a tiny kernel of axioms. We describe here the theory of natural numbers or
non-negative integers. We need:
PEANO AXIOMS constant symbol, 0; One function symbol, S (successor). The Peano
axioms definenatural numbers and addition.
That is, 0 is a natural number, and for every object n, if n is a natural number, then S(n) is a natural
So the natural numbers are 0, S(0), S(S(0)), and so on. We also need axioms to constrain
the successorfunction: ∀n 0 != S(n) . ∀m, n m != n ⇒ S(m) != S(n) .
Now we can define addition in terms of the successor function: ∀m NatNum(m) ⇒ + (0, m) = m .
∀m, n NatNum(m) 𝖠 NatNum(n) ⇒ + (S(m), n) = S(+(m, n))
The first of these axioms says that adding 0 to any natural number m gives m itself.
Addition isrepresented using the binary function symbol “+” in the term + (m, 0);
To make our sentences about numbers easier to read, we allow the use of infix notation. We
can also writeS(n) as n + 1, so the second axiom becomes :
This axiom reduces addition to repeated application of the successor function. Once we have
addition, it is straightforward to define multiplication as repeated addition, exponentiation as
repeated multiplication, integer division and remainders, prime numbers, and so on. Thus, the
whole of number theory (including cryptography) can be built up from one constant, one
function, one predicate and four axioms.
To know whether an element is a member of a set Distinguish sets from objects that are not sets.
The empty set is a constant written as { }. There is one unary predicate, Set, which is true of sets.
The binarypredicates are
s1 ∩ s2 (the intersection of two sets), s1 𝖴 s2 (the union of two sets), and {x|s} (the set
resulting fromadjoining element x to set s).
Forward Chaining and backward chaining in AI
Inference engine:
The inference engine is the component of the intelligent system in artificial intelligence,
which applies logical rules to the knowledge base to infer new information from known facts.
The first inferenceengine was part of the expert system. Inference engine commonly proceeds
in two modes, which are:
a. Forward chaining
b. Backward
chainingHorn Clause
Horn clause and definite clause are the forms of sentences, which enables knowledge base
to use a more restricted and efficient inference algorithm. Logical inference algorithms use
forward and backward chaining approaches, which require KB in the form of the first-order
definite clause.
Definite clause: A clause which is a disjunction of literals with exactly one positive literal is
known as a definite clause or strict horn clause.
Horn clause: A clause which is a disjunction of literals with at most one positive literal is
known as horn clause. Hence all the definite clauses are horn clauses.
A. Forward Chaining
Forward chaining is also known as a forward deduction or forward reasoning method when
using an inference engine. Forward chaining is a form of reasoning which start with atomic
sentences in the knowledge base and applies inference rules (Modus Ponens) in the forward
direction to extract more data until a goal is reached.
The Forward-chaining algorithm starts from known facts, triggers all rules whose premises are
satisfied, and add their conclusion to the known facts. This process repeats until the problem
is solved.
Properties of Forward-Chaining:
Consider the following famous example which we will use in both approaches:
In the first step we will start with the known facts and will choose the sentences which do
not have implications, such as: American(Robert), Enemy(A, America), Owns(A, T1), and
Missile(T1). Allthese facts will be represented as below.
At the second step, we will see those facts which infer from available facts and with
satisfied premises.Rule-(1) does not satisfy premises, so it will not be added in the first
Rule-(4) satisfy with the substitution {p/T1}, so Sells (Robert, T1, A) is added, which infers from the
conjunction of Rule (2) and (3).
Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from Rule-(7).
At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1, r/A}, so
we canadd Criminal(Robert) which infers all the available facts. And hence we reached our goal
Hence it is proved that Robert is Criminal using forward chaining approach.
Backward Chaining:
In backward-chaining, we will use the same above example, and will rewrite all the rules.
o Enemy(p, America) →Hostile(p) .........................(6)
o Enemy (A, America).......................... (7)
o American(Robert). ......................... (8)
Backward-Chaining proof:
In Backward chaining, we will start with our goal predicate, which is Criminal(Robert), and then
inferfurther rules.
At the first step, we will take the goal fact. And from the goal fact, we will infer other facts, and
at last, wewill prove those facts true. So our goal fact is "Robert is Criminal," so following is the
predicate of it.
At the second step, we will infer other facts form goal fact which satisfies the rules. So as
we can see in Rule-1, the goal predicate Criminal (Robert) is present with substitution
{Robert/P}. So we will add all the conjunctive facts below the first level and will replace p
with Robert.
Here we can see American (Robert) is a fact, so it is proved here.
Step-3:t At step-3, we will extract further fact Missile(q) which infer from Weapon(q), as
it satisfiesRule-(5). Weapon (q) is also true with the substitution of a constant T1 at q.
At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which satisfies
the Rule- 4, with the substitution of A in place of r. So these two statements are proved here.
At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies Rule-
6. Andhence all the statements are proved true using backward chaining.
Difference between backward chaining and forward chaining
4 Forward chaining Backward chaining
. reasoning applies a reasoning applies a depth-
breadth-first first search strategy.
5 Forward chaining Backward chaining
. tests for all the available only tests for few required
rules rules.
variable Weather,we might have P( Weather = Sunny) = 0.7
P(Weather =
Rain) = 0.2
Cloudy) = 0.08
P(Weather =
Snow) = 0.02
Each random variable X has a domain of possible values (x1,...,xn) that it can take on.
• We can view proposition symbols as random variables as well, if we assume that
they have adomain [true,false).
• Thus, the expression P(Cavity) can be viewed as shorthand for P(Cavity = true).
• Similarly, P(->Cavity) is shorthand for P(Cavity =false).
• Sometimes, we will want to talk about the probabilities of all the possible values
of a randomvariable. In this case, we will use an expression such as P(Weather )
• for example, we would write P(Weather) =
(0.7,0.2,0.08,0.02)This statement defines a probability
• We can also use logical connectives to make more complex sentences and assign
probabilities tothem.
For example, P(Cavity A ¬Insured)
Conditional probability:
• Once the agent has obtained some evidence concerning the previously unknown
propositionsmaking up the domain, prior probabilities are no longer applicable.
Instead, we use conditional or posterior probabilities, with the notation P(A|B)
• This is read as "the probability of A given that all we know is B."
• P(B|A) means "Event B given Event A"
• In other words, event A has already happened, now what is the chance of event B?
• P(B|A) is also called the "Conditional Probability" of B given A.
Ex:Drawing 2 Kings from a Deck
• Event A is drawing a King first, and Event B is drawing a King second.
• For the first card the chance of drawing a King is 4 out of 52 (there are 4 Kings in a
deck of 52cards):
• P(A) = 4/52
• But after removing a King from the deck the probability of the 2nd card drawn is
less likely tobe a King (only 3 of the 51 cards left are Kings):
• P(B|A) = 3/51
And so: P(A and B) = P(A) x P(B|A) = (4/52) x (3/51) = 12/2652 = 1/221
• So the chance of getting 2 Kings is 1 in 221, or about 0.5
BAYES Theorem:
• Bayes' Theorem is a way of finding a probability when we know certain other probabilities.
The formula is
• Which tells us: how often A happens given that B happens, written P(A|B),
• When we know: How often B happens given that A happens, written P(B|A)
• and how likely A is on its own, written P(A)
• and how likely B is on its own,
written P(B)Example:
make smokeWe can then discover the probability of dangerous Fire when there is
=1% x 90/10%
to be sure.Example 2:
But cloudy mornings are common (about 40% of days start cloudy)
We will use Rain to mean rain during the day, and Cloud to mean
P(Rain|Cloud) = P(Rain)
P(Cloud|Rain)/P(Cloud)P(Rain) is
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming,
coined the term “Machine Learning”. He defined machine learning as – a “Field of study
that gives computers the capability to learn without being explicitly programmed”. In a
very layman’s manner, Machine Learning(ML) can be explained as automating and
improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance. The process starts with feeding
good quality data and then training our machines(computers) by building machine
learning models using the data and different algorithms. The choice of algorithms depends
on what type of data do we have and what kind of task we are trying to
automate. Example: Training of students during exams. While preparing for the exams
students don’t actually cram the subject but try to learn it with complete understanding.
Before the examination, they feed their machine(brain) with a good amount of high-
quality data (questions and answers from different books or teachers’ notes, or online
video lectures). Actually, they are training their brain with input as well as output i.e. what
kind of approach or logic do they have to solve a different kinds of questions. Each time
they solve practice test papers and find the performance (accuracy /score) by comparing
answers with the answer key given, Gradually, the performance keeps on increasing,
gaining more confidence with the adopted approach. That’s how actually models are built,
train machine with data (both inputs and outputs are given to the model), and when the
time comes test on data (with input only) and achieve our model scores by comparing its
answer with the actual output which has not been fed while training. Researchers are
working with assiduous efforts to improve algorithms, and techniques so that these
models perform even much better.
Building models with suitable algorithms and techniques on the training set.
Testing our conceptualized model with data that was not fed to the model at the
time of training and evaluating its performance using metrics such as F1 score,
precision, and recall.
Linear Algebra
Statistics and Probability
Graph theory
Programming Skills – Languages such as Python, R, MATLAB, C++, or Octave
Machine learning is programming computers to optimize a performance criterion using
example data or past experience . We have a model defined up to some parameters, and
learning is the execution of a computer program to optimize the parameters of the model
using the training data or past experience. The model may be predictive to make predictions
in the future, or descriptive to gain knowledge from data.
The field of study known as machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.
Definition of learning: A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P , if its performance at tasks T, as measured by P , improves
with experience E.
Handwriting recognition learning problem
Task T : Recognizing and classifying handwritten words within images
Performance P : Percent of words correctly classified
Training experience E : A dataset of handwritten words with given classifications
A robot driving learning problem
Task T : Driving on highways using vision sensors
Performance P : Average distance traveled before an error
Training experience E : A sequence of images and steering commands recorded while
observing a human driver
Definition: A computer program which learns from experience is called a machine learning program or
simply a learning program .
Machine learning implementations are classified into four major categories, depending on the nature
of the learning “signal” or “response” available to a learning system which are as follows:
A. Supervised learning:
Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs. The given data is labeled .
Both classification and regression problems are supervised learning problems .
Example — Consider the following data regarding patients entering a clinic . The data consists of
the gender and age of the patients and each patient is labeled as “healthy” or “sick”.
gender age label
M 48 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
M 34 healthy
M 21 healthy
Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets
consisting of input data without labeled responses. In unsupervised learning algorithms, classification
or categorization is not included in the observations. Example: Consider the following data regarding
patients entering a clinic. The data consists of the gender and age of the patients.
gender age
M 48
M 67
F 53
M 49
F 34
M 21
As a kind of learning, it resembles the methods humans use to figure out that certain objects or
events are from the same class, such as by observing the degree of similarity between objects. Some
recommendation systems that you find on the web in the form of marketing automation are based
on this type of learning.
C. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but instead must
discover which actions yield the most reward by trying them. For example — Consider teaching a
dog a new trick: we cannot tell him what to do, what not to do, but we can reward/punish it if it
does the right/wrong thing.
When watching the video, notice how the program is initially clumsy and unskilled but steadily
improves with training until it becomes a champion.
Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many) of the target
outputs missing. There is a special case of this principle known as Transduction where the entire set
of problem instances is known at learning time, except that part of the targets are missing. Semi-
supervised learning is an approach to machine learning that combines small labeled data with a
large amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning.
Supervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labelled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labelled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train
the machine with all the different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then it will be
labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category. Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Supervised learning is the types of machine learning in which machines are trained using well "labelled"
training data, and on basis of that data, machines predict the output. The labelled data means some input data
is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the supervision of
the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine learning
model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x)
with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud Detection,
spam filtering, etc.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on
the bases of a number of sides, and predicts the output.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so that the model
can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control parameters,
which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which
means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output variable. It
is used for the prediction of continuous variables, such as Weather forecasting, Market Trends, etc. Below are
some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without guidance. Here the task of the machine is to
group unsorted information according to similarities, patterns, and differences without any prior training of
Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats
‘. But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily
categorize the above picture into two parts. The first may contain all pics having dogs in them and the
second part may contain all pics having cats in them. Here you didn’t learn anything before, which means no
training data or examples.
It allows the model to work on its own to discover patterns and information that was previously undetected.
It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such
as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that describe large
portions of your data, such as people that buy X also tend to buy Y.
In the previous topic, we learned supervised machine learning in which models are trained using labeled data
under the supervision of training data. But there may be many cases in which we do not have labeled data and
need to find the hidden patterns from the given dataset. So, to solve such types of cases in machine learning,
we need unsupervised learning techniques.
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed
to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different
types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have
any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the
image features on their own. Unsupervised learning algorithm will perform this task by clustering the image
dataset into the groups according to similarities between images.
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences, which makes it
closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised learning more
o In real-world, we do not always have input data with the corresponding output so to solve such cases, we need
unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding outputs
are also not given. Now, this unlabeled input data is fed to the machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the
similarities and difference between the objects.
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the presence and absence of those
o Association: An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database. It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people who buy X item (suppose a bread) are
also tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
Supervised machine
Parameters learning Unsupervised machine learning
Algorithms are trained using Algorithms are used against data that is not
Input Data labeled data. labeled
Complexity Simpler method Computationally complex
No. of classes No. of classes is known No. of classes is not known
Let us discuss what is learning for a machine is as shown below media as follows:
A machine is said to be learning from past Experiences(data feed-in) with respect to some class of tasks if
its Performance in a given Task improves with the Experience. For example, assume that a machine has to
predict whether a customer will buy a specific product let’s say “Antivirus” this year or not. The machine will
do it by looking at the previous knowledge/past experiences i.e the data of products that the customer had
bought every year and if he buys Antivirus every year, then there is a high probability that the customer is
going to buy an antivirus this year as well. This is how machine learning works at the basic conceptual level.
Supervised learning is when the model is getting trained on a labelled dataset. A labelled dataset is one
that has both input and output parameters. In this type of learning both training and validation, datasets are
labelled as shown in the figures below.
Both the above figures have labelled data set as follows:
Figure A: It is a dataset of a shopping store that is useful in predicting whether a customer will purchase
a particular product under consideration or not based on his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the customer
won’t purchase it.
Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed based on
different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
Training the system:
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training data and the rest as
testing data. In training data, we feed input as well as output for 80% of data. The model learns from
training data only. We use different machine learning algorithms(which we will discuss in detail in the next
articles) to build our model. Learning means that the model will build some logic of its own.
Once the model is ready then it is good to be tested. At the time of testing, the input is fed from the
remaining 20% of data that the model has never seen before, the model will predict some value and we will
compare it with the actual output and calculate the accuracy.
A. Classification: It is a Supervised Learning task where output is having defined labels(discrete value). For
example in above Figure A, Output – Purchased has defined labels i.e. 0 or 1; 1 means the customer will
purchase, and 0 means that the customer won’t purchase. The goal here is to predict discrete values
belonging to a particular class and evaluate them on the basis of accuracy.
It can be either binary or multi-class classification. In binary classification, the model predicts either 0 or 1;
yes or no but in the case of multi-class classification, the model predicts more than one
class. Example: Gmail classifies mails in more than one class like social, promotions, updates, and forums.
B. Regression: It is a Supervised Learning task where output is having continuous value.
For example in above Figure B, Output – Wind Speed is not having any discrete value but is continuous in a
particular range. The goal here is to predict a value as much closer to the actual output value as our model
can and then evaluation is done by calculating the error value. The smaller the error the greater the
accuracy of our regression model.
Unsupervised Learning:
Or unsupervised machine learning analyzes and clusters unlabeled datasets using machine learning
algorithms. These algorithms find hidden patterns and data without any human intervention, i.e., we don’t
give output to our model. The training model has only input parameter values and discovers the groups or
patterns on its own. Data-set in Figure A is Mall data that contains information about its clients that
subscribe to them. Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and unsupervised learning
techniques, the mall can easily group clients based on the parameters we are feeding in.
Types of Unsupervised Learning are as follows:
Clustering: Broadly this technique is applied to group data based on different patterns, such as
similarities or differences, our machine model finds. These algorithms are used to process raw,
unclassified data objects into groups. For example, in the above figure, we have not given output
parameter values, so this technique will be used to group clients based on the input parameters
provided by our data.
Association: This technique is a rule-based ML technique that finds out some very useful relations
between parameters of a large data set. This technique is basically used for market basket analysis that
helps to better understand the relationship between different products. For e.g. shopping stores use
algorithms based on this technique to find out the relationship between the sale of one product w.r.t to
another’s sales based on customer behavior. Like if a customer buys milk, then he may also buy bread,
eggs, or butter. Once trained well, such models can be used to increase their sales by planning different
Some algorithms: K-Means Clustering
DBSCAN – Density-Based Spatial Clustering of Applications with Noise
BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies
Hierarchical Clustering
Supervised and Unsupervised learning are the two techniques of machine learning. But both the
techniques are used in different scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.
Supervised Machine Learning:
Supervised learning is a machine learning method in which models are trained using labeled data. In
supervised learning, models need to find the mapping function to map the input variable (X) with the
output variable (Y).
Supervised learning needs supervision to train the model, which is similar to as a student learns things
in the presence of a teacher. Supervised learning can be used for two types of
problems: Classification and Regression.
Example: Suppose we have an image of different types of fruits. The task of our supervised learning
model is to identify the fruits and classify them accordingly. So to identify the image in supervised
learning, we will give the input data as well as output for that, which means we will train the model by
the shape, size, color, and taste of each fruit. Once the training is completed, we will test the model by
giving the new set of fruit. The model will identify the fruit and predict the output using a suitable
Unsupervised Machine Learning:
Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the
input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the
data by its own.
Learn more Unsupervised Machine Learning
Unsupervised learning can be used for two types of problems: Clustering and Association.
Example: To understand the unsupervised learning, we will use the example given above. So unlike
supervised learning, here we will not provide any supervision to the model. We will just provide the
input dataset to the model and allow the model to find the patterns from the data. With the help of a
suitable algorithm, the model will train itself and divide the fruits into different groups according to the
most similar features between them.
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.
Semi-supervised Learning:
As the name suggests, its working lies between Supervised and Unsupervised techniques. We use these
techniques when we are dealing with data that is a little bit labeled and the rest large portion of it is
unlabeled. We can use the unsupervised techniques to predict labels and then feed these labels to
supervised techniques. This technique is mostly applicable in the case of image data sets where usually all
images are not labeled.
Reinforcement Learning:
In this technique, the model keeps on increasing its performance using Reward Feedback to learn the
behavior or pattern. These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better performers in Go
Game. Each time we feed in data, they learn and add the data to their knowledge which is training data. So,
the more it learns the better it gets trained and hence experienced.
Agents observe input.
An agent performs an action by making some decisions.
After its performance, an agent receives a reward and accordingly reinforces and the model stores in
state-action pair of information.
Temporal Difference (TD)
Deep Adversarial Networks
Supervised Learning:
Regression: Linear Regression, multi linear regression, Polynomial Regression, logistic regression, Non-linear
Regression, Model evaluation methods. Classification: – support vector machines ( SVM) , Naïve Bayes
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get sales
on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction
about the sales for this year. So to solve such type of prediction problems in machine learning, we need
regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining the causal-effect relationship
between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the least important
factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship between
the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent
variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression. And if there
is more than one input variable, then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the below image. Here
we are predicting the salary of an employee on the basis of the year of experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the classification problems.
In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or
not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how
they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid
function is used to model the data in logistic regression. The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values below
the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and corresponding
conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for such
case, linear regression will not best fit to those datapoints. To cover such datapoints, we need Polynomial
o In Polynomial regression, the original features are transformed into polynomial features of given degree and
then modeled using a linear model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our independent/input
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a
single element has different degrees instead of multiple variables with the same degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which helps to
predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of
datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints within
the boundary lines and the hyperplane (best-fit line) must contain a maximum number of datapoints.
Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method
that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the value
of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
y= a0+a1x+ ε
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
If a single independent variable is used to predict the value of a numerical dependent variable, then such a
Linear Regression algorithm is called Simple Linear Regression.
If more than one independent variable is used to predict the value of a numerical dependent variable, then such
a Linear Regression algorithm is called Multiple Linear Regression.
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a
relationship is termed as a Positive linear relationship.
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of regression, so we
need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the cost
function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is
o We can use the cost function to find the accuracy of the mapping function, which maps the input variable to the
output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost function
Unsupervised learning
Nearest neighbor models – K-means – clustering around medoids– silhouettes – hierarchical clustering –
k-d trees ,Clustering trees – learning ordered rule lists – learning unordered rule .
Reinforcement learning- Example: Getting Lost -State and Action Spaces
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
o There is no particular way to determine the best value for "K", so we need to try some values to find the best
out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
o Step-4: Calculate the variance and place a new centroid of each cluster.
o Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
o Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
o Step-7: The model is ready.
o Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters.
It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points
to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
o From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ
depending on how they work. As there is no requirement to predetermine the number of clusters as we did in
the K-Means algorithm.
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points
as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will
also be N.
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1
o Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as
per the problem.
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate the average distance between two clusters. It is also
one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of problem or business
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and the
right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular shape. The hight is
decided according to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6, in
another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of similar
data points. The objects with the possible similarities remain in a group that has less or no
similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior,
etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the
type of dataset that we are using. In classification, we work with the labeled data set, whereas
in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find
out the things. The clustering technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points
of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the dense region can be connected. This algorithm does it
by identifying different clusters in the dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities
and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how
a dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are
different types of clustering algorithms published, but only a few are commonly used. The clustering
algorithm is based on the kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the minimum distance
between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number of
clusters must be specified in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid
to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In
this algorithm, the areas of high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the
k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data
points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-
up hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then
successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the
number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one group
that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.
o Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in
an environment by performing the actions and seeing the results of actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data,
unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such
as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of an agent in reinforcement
learning is to improve the performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a
better way. Hence, we can say that "Reinforcement learning is a type of machine learning method where an
intelligent agent (computer program) interacts with the environment and learns to act within that." How a
Robotic dog learns the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning. Here
we do not need to pre-program the agent, as it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the diamond.
The agent interacts with the environment by performing some actions, and based on those actions, the state of
the agent gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative
feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative point.
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).
o The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive rewards.
1. Value-based:
The value-based approach is about to find the optimal value function, which is the maximum value at a state
under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without using the value
function. In this approach, the agent tries to apply such a policy that the action performed in each step helps to
maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the environment, and the agent
explores that environment to learn it. There is no particular solution or algorithm for this approach because the
model representation is different for each environment.
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived states
of the environment to the actions taken on those states. A policy is the core element of the RL as it alone can
define the behavior of the agent. In some cases, it may be a simple function or a lookup table, whereas, for
other cases, it may involve general computation as a search process. It could be deterministic or a stochastic
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P*At =a | St = s]
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as a reward signal.
These rewards are given according to the good and bad actions taken by the agent. The agent's main objective
is to maximize the total number of rewards for good actions. The reward signal can change the policy, such as
if an action selected by the agent leads to low reward, then the policy may change to select other actions in
the future.
3) Value Function: The value function gives information about how good the situation and action are and how
much reward an agent can expect. A reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the future. The value function depends on
the reward as, without reward, there could be no value. The goal of estimating values is to achieve more
4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will behave.
Such as, if a state and an action are given, then a model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action by considering all
future situations before actually experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based approach. Comparatively, an
approach without using a model is called a model-free approach.
Let's take an example of a maze environment that the agent needs to explore. Consider the below image:
In the above image, the agent is at the very first block of the maze. The maze is consisting of an S 6 block, which
is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block, then get the +1
reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move up, move down,
move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step. To memorize the
steps, it assigns 1 value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block. But
what will the agent do if he starts moving from the block, which has 1 value block on both sides? Consider the
below diagram:
It will be a difficult condition for the agent whether he should go up or down as each block has the same value.
So, the above approach is not suitable for the agent to reach the destination. Hence to solve the problem, we
will use the Bellman equation, which is the main concept behind reinforcement learning.
It is a way of calculating the value functions in dynamic programming or environment that leads to modern
reinforcement learning.
γ = Discount factor
In the above equation, we are taking the max of the complete values because the agent tries to find the
optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given environment. We will start
from the block, which is next to the target block.
o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."
γ = Discount factor
In the above equation, we are taking the max of the complete values because the agent tries to find the
optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given environment. We will start
from the block, which is next to the target block.
V(s3) = max *R(s,a) + γV(s`)+, here V(s')= 0 because there is no further state to move.
V(s2) = max *R(s,a) + γV(s`)+, here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward at this state.
V(s1) = max *R(s,a) + γV(s`)+, here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this state
V(s5) = max *R(s,a) + γV(s`)+, here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no reward at this
state also.
V(s9) = max *R(s,a) + γV(s`)+, here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no reward at this
state also.
Now, we will move further to the 6th block, and here agent may change the route because it always tries to
find the optimal path. So now, let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a bump if he moves
to the fire pit, then he will get the -1 reward. But here we are taking only positive rewards, so for this, he will
move to upwards only. The complete block values will be calculated using this formula. Consider the below
Types of Reinforcement learning
There are mainly two types of reinforcement learning, which are:
o Positive Reinforcement
o Negative Reinforcement
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the tendency that expected behavior
would occur again. It impacts positively on the behavior of the agent and increases the strength of the
This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement may
lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it increases the tendency
that the specific behavior will occur again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on situation and behavior, but it provides
reinforcement only to meet minimum behavior.
How to represent the agent state?
We can represent the agent state using the Markov State that contains all the required information from the
history. The State St is Markov state if it follows the given condition:
The Markov state follows the Markov property, which says that the future is independent of the past and can
only be defined with the present. The RL works on fully observable environments, where the agent can
observe the environment and act for the new state. The complete process is known as Markov Decision
process, which is explained below:
MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized using
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and move to the state s2,
then the state transition from s1 to s2 only depends on the current state and future action and states do not
depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any past action or
state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess game, the players
only focus on the current state and do not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only the finite
Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that uses the Markov
Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and transition
function P. These two components (S and P) can define the dynamics of the system.
o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s."
o The below flowchart explains the working of Q- learning:
o State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is an on-policy temporal difference learning
method. The on-policy control method selects the action for each state while learning using a specific
o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the maximum
reward for the next state is not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy, which has determined the original
o The SARSA is named because it uses the quintuple Q(s, a, r, s', a').
o Where, s: original state
o a: Original action
o r: reward observed while following the states
s' and a': New state, action pair.
o Deep Q Neural Network (DQN):
o As the name suggests, DQN is a Q-learning using Neural networks.
o For a big state space environment, it will be a challenging and complex task to define and update a Q-
o To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-table, neural
network approximates the Q-values for each action and state.
Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the agent that what actions should be
taken for maximizing the reward under what circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ), probability, and end
states s'. But there is no any Q-value is given so first consider the below image:
In the above image, we can see there is an agent who has three values options, V(s1), V(s2), V(s3). As this is
MDP, so agent only cares for the current state and the future state. The agent can go to any direction (Up,
Left, or Right), so he needs to decide where to go for the optimal path. Here agent will take a move as per
probability bases and changes the state. But if we want some exact moves, so for this, we need to make some
changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each state, we will use a
pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more lubricative than others, and
according to the best Q-value, the agent takes his next move. The Bellman equation can be used for deriving
the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain state, so the Q -
value equation will be:
The Q stands for quality in Q-learning, which means it specifies the quality of an action taken by the agent.
A Q-table or matrix is created while performing the Q-learning. The table follows the state and action pair, i.e.,
[s, a], and initializes the values to zero. After each action, the table is updated, and the q-values are stored
within the table.
The RL agent uses this Q-table as a reference table to select the best action based on the q-values.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
Reinforcement Learning Applications
1. Robotics:
a. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2. Control:
. RL can be used for adaptive control such as Factory processes, admission control in telecommunication, and
Helicopter pilot is an example of reinforcement learning.
3. Game Playing:
. RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
. RL can be used for optimizing the chemical reactions.
5. Business:
. RL is now used for business strategy planning.
6. Manufacturing:
. In various automobile manufacturing companies, the robots use deep reinforcement learning to pick goods and put
them in some containers.
7. Finance Sector:
. The RL is currently used in the finance sector for evaluating trading strategies.
From the above discussion, we can say that Reinforcement Learning is one of the most interesting and useful
parts of Machine learning. In RL, the agent explores the environment by exploring it without any human
intervention. It is the main learning algorithm that is used in Artificial Intelligence. But there are some cases
where it should not be used, such as if you have enough data to solve the problem, then other ML algorithms
can be used more efficiently. The main issue with the RL algorithm is that some of the parameters may affect
the speed of the learning, such as delayed feedback.
Code No: R20A0513
(Autonomous Institution – UGC, Govt. of India)
III B.Tech I Semester Regular Examinations, December 2022
Artificial Intelligence
Roll No
Page 1 of 1
Code No: R17A1204
(Autonomous Institution – UGC, Govt. of India)
IV B.Tech I Semester Supplementary Examinations, November 2022
Artificial Intelligence
Roll No
B Identify the type of control strategy is used in the 8-puzzle problem. Explain [7M]
3 A Justify the need for minimax algorithm. Explicate the steps of minimax [7M]
B Explain Non – Monotonic reasoning and discuss the various logic associated [7M]
with it.
4 A Define the syntactic elements of first-Order logic [7M]
Page 2 of 1
Code No: R18A1205
(Autonomous Institution – UGC, Govt. of India)
III B.Tech I Semester Supplementary Examinations, June 2022
Artificial Intelligence
Roll No
2 What is best first search? Explain in detail A* algorithm? Discuss BFS Algorithm [14M]
4 What is First Order Logic? State and Prove Baye’s Theorem and mention its [14M]
5 Give a detail note on a generic knowledge-based agent. In the wumpus world, agent [14M]
will have five sensors. Mention Various Other Knowledge Representation Schemes
6 Prove the following assertion: for every game tree, the utility obtain by MAX using [14M]
mini max decision against a suboptimal MIN will be never be lower than the utility
obtained playing against an optimal MIN. Can you come up with a game tree in which
MAX can do still better using a suboptimal strategy against a suboptimal MIN?
7 Discuss in detail about Winston’s Learning Program with its implementation [14M]
Page 3 of 1
Code No: R17A1204
(Autonomous Institution – UGC, Govt. of India)
IV B.Tech I Semester Supplementary Examinations, June 2022
Artificial Intelligence
Roll No
2 a. Define Agent Program. Explain the Following Agent Programs with [5M]
Respect to Intelligent Systems
i. Goal-Based Reflex Agent
ii. Utility-Based Agent
b. Explain the following Heuristic Search Strategies with Suitable Examples: [9M]
i. Generic Best-First Algorithm
ii. A * Algorithm
5 a. Explain the Baye’s Rule and Its Applications in Artificial Intelligence. [7M]
b. Differentiate between monotic and non-monotic reasoning. [7M]
Page 4 of 1
Code No: R18A0526
(Autonomous Institution – UGC, Govt. of India)
IV B.Tech I Semester Regular/Supplementary Examinations, November 2022
Machine Learning
(CSE & IT)
Roll No
Page 5 of 1
Code No: R20A6601
(Autonomous Institution – UGC, Govt. of India)
III B.Tech I Semester Regular Examinations, December 2022
Machine Learning
Roll No
What is
the probability that Tom is not sleeping although it is raining heavily and he is
not well? He was upset that he could not join for the sleep over in his friend’s
home and his mom forced him to stay at home.
6 A Describe the naive Bayes theorem in text classification [7M]
B What are the limitations of Bayes optimal classifier? Explain how does the [7M]
Gibbs algorithm tries to resolve the issues.
7 A Explain the steps in of the BACKPROPAGATION algorithm for feedforward [7M]
networks that contains two layers of sigmoid units.
B Consider the Artificial Neural Network with the following values [7M]
X1=1,x2=0, w11=0.25,w12=0.10,w21=0.15,w22=0.1,w4=0.3,w5=0.4,b1 to h1
and h2=1 and bias b2 to O1=1. Assume the actual output=0.95. Find the
predicted output. Find the error and through backpropagation, find out the
weights of w4 and w5 provided learning rate =0.3.
Input values x1, and x2, randomly assigned weights are w1, w2, w3, w4, w5,
w6, w7 and w8. Target values o1 = 0.05 and o2 = 0.95. Bias values b1 and b2.
Page 7 of 1
Use the sigmoid activation function. Learning rate α = 0.5.
8 Consider the following neural network with the input, output and weight [14M]
parameters values shown in the diagram. The activation values in each neuron
is calculated using the sigmoid activation function. Now, answer the
a. For the given input i1 and i2 as shown in the diagram, compute
the output of the hidden layer and output layer neurons.
b. Compute the error in the network with the initialized weight
parameters shown in the diagram.
c. Update the weight parameters for w7 and w8 using
backpropagation algorithm in the first iteration. Consider
learning rate as 0.01.
9 A Illustrate how to find the state sequence with any example. [4M]
Page 8 of 1
3 a) What are the benefits of pruning in decision tree induction? Explain different [8M]
approaches to tree pruning?
b) Explain the concept of a Perceptron with a neat diagram. [6M]
4 a) Explain how Support Vector Machine can be used for classification of linearly [8M]
separable data.
b) Give a detail note on kernel functions. [6M]
5 Describe boosting and ADA boosting algorithm with neat sketch [14M]
7 Summarize about the Q-learning model and explain with diagram [14M]
Page 9 of 1
Code No: R18A0526
(Autonomous Institution – UGC, Govt. of India)
IV B.Tech I Semester Supplementary Examinations, June 2022
Machine Learning
(CSE & IT)
Roll No
3 What is the difference between logistic regression and linear regression give an [14M]
6 Explain K-Nearest Neighbor(KNN) Algorithm with an example and list the [14M]
advantages and disadvantages of K-NN
7 a) What is the goal of the support vector machine (SVM)? How to compute the [7M]
b)What are the elements of reinforcement learning? [7M]
8 Explain Genetic Programming with an example? And What are the operators of [14M]
genetic algorithm?
Page 10 of 1
Code No: R17A0534
(Autonomous Institution – UGC, Govt. of India)
IV B.Tech- II Semester Supplementary Examinations, May 2022
Machine Learning
Roll No
4 a) Explain how Support Vector Machine can be used for classification of linearly [7M]
separable data.
b) Elucidate K-means algorithm with neat diagram. [7M]
5 a) Describe the random forest algorithm to improve classifier accuracy [7M]
b) Explain the concept of Bagging with its uses? [7M]
6 a) How can be the data classified using KNN algorithm with neat sketch? [7M]
b) Discuss the various distance measure algorithms. [7M]
7 a) Write about the learning Rule sets. [7M]
b) Write some common evaluation functions in the learning rule sets. [7M]
8 Explain normal and Binomial Distributions with an example. [14M]
9 a) Discuss about the mutation operator. [7M]
b) Examine how genetic algorithm searches large space of candidate objects with [7M]
an example with fitness function
10 Assess the parallelizing Genetic Algorithms with an example. [14M]
Page 11 of 1