0% found this document useful (0 votes)
36 views

aiml

The document outlines the course objectives and structure for CS3491 Artificial Intelligence and Machine Learning, covering topics such as search techniques, probabilistic reasoning, supervised and unsupervised learning, and neural networks. It includes practical exercises and expected course outcomes, emphasizing the application of various algorithms and models. Additionally, it lists recommended textbooks and references for further study in the field.

Uploaded by

Abi .J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

aiml

The document outlines the course objectives and structure for CS3491 Artificial Intelligence and Machine Learning, covering topics such as search techniques, probabilistic reasoning, supervised and unsupervised learning, and neural networks. It includes practical exercises and expected course outcomes, emphasizing the application of various algorithms and models. Additionally, it lists recommended textbooks and references for further study in the field.

Uploaded by

Abi .J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

www.LearnEngineering.

in

CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

COURSE OBJECTIVES:
The main objectives of this course are to:
 Study about uninformed and Heuristic search techniques.
 Learn techniques for reasoning under uncertainty
 Introduce Machine Learning and supervised learning algorithms
 Study about ensembling and unsupervised learning algorithms
 Learn the basics of deep learning using neural networks

n
UNIT I PROBLEM SOLVING

g.i
Introduction to AI - AI Applications - Problem solving agents – search algorithms –
uninformed search strategies – Heuristic search strategies – Local search and optimization
problems – adversarial search – constraint satisfaction problems (CSP)

rin
UNIT II PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic
reasoning – Bayesian networks – exact inference in BN – approximate inference in BN –
ee
causal networks.
gin
UNIT III SUPERVISED LEARNING
Introduction to machine learning – Linear Regression Models: Least squares, single
& multiple variables, Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function – Probabilistic discriminative model - Logistic regression,
En

Probabilistic generative model – Naive Bayes, Maximum margin classifier – Support vector
machine, Decision Tree, Random forests
arn

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING


Combining multiple learners: Model combination schemes, Voting, Ensemble
Learning - bagging, boosting, stacking, Unsupervised learning: K-means, Instance Based
Learning: KNN, Gaussian mixture models and Expectation maximization
Le

UNIT V NEURAL NETWORKS


Perceptron - Multilayer perceptron, activation functions, network training – gradient
w.

descent optimization – stochastic gradient descent, error backpropagation, from shallow


networks to deep networks –Unit saturation (aka the vanishing gradient problem) – ReLU,
hyperparameter tuning, batch normalization, regularization, dropout.
ww

PRACTICAL EXERCISES:
1. Implementation of Uninformed search algorithms (BFS, DFS)
2. Implementation of Informed search algorithms (A*, memory-bounded A*)
3. Implement naïve Bayes models
4. Implement Bayesian Networks
5. Build Regression models
6. Build decision trees and random forests
7. Build SVM models
8. Implement ensembling techniques
9. Implement clustering algorithms

www.LearnEngineering.in
www.LearnEngineering.in

10. Implement EM for Bayesian networks


11. Build simple NN models
12. Build deep learning NN models

COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Use appropriate search algorithms for problem solving
CO2: Apply reasoning under uncertainty
CO3: Build supervised learning models
CO4: Build ensembling and unsupervised models
CO5: Build deep learning neural network models

n
g.i
TEXT BOOKS:
1. Stuart Russell and Peter Norvig, “Artificial Intelligence – A Modern Approach”, Fourth

rin
Edition, Pearson Education, 2021.
2. Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Fourth Edition, 2020.

REFERENCES: ee
1. Dan W. Patterson, “Introduction to Artificial Intelligence and Expert Systems”, Pearson
Education,2007
gin
2. Kevin Night, Elaine Rich, and Nair B., “Artificial Intelligence”, McGraw Hill, 2008
3. Patrick H. Winston, "Artificial Intelligence", Third Edition, Pearson Education, 2006
4. Deepak Khemani, “Artificial Intelligence”, Tata McGraw Hill Education, 2013
(http://nptel.ac.in/)
En

5. Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.


6. Tom Mitchell, “Machine Learning”, McGraw Hill, 3rd Edition,1997.
7. Charu C. Aggarwal, “Data Classification Algorithms and Applications”, CRC Press, 2014
arn

8. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, “Foundations of Machine


Learning”, MIT Press, 2012.
9. Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016
Le
w.
ww

www.LearnEngineering.in
www.LearnEngineering.in

UNIT I PROBLEM SOLVING


Introduction to AI - AI Applications - Problem solving agents – search algorithms –
uninformed search strategies – Heuristic search strategies – Local search and optimization
problems – adversarial search – constraint satisfaction problems (CSP)

Part A

1. What is Artificial Intelligence?


Artificial Intelligence is the study of how to make computers do things which
at the moment people do better.

n
2. What is an agent?

g.i
An agent is anything that can be viewed as perceiving its environment through
sensors and acting upon that environment through actuators.

rin
3. What are the different types of agents?
A human agent has eyes, ears, and other organs for sensors and hands, legs,
ee
mouth, and other body parts for actuators.
A robotic agent might have cameras and infrared range finders for sensors
and various motors for actuators.
gin
A software agent receives keystrokes, file contents, and network packets as
sensory inputs and acts on the environment by displaying on the screen, writing files,
and sending network packets.
En

Generic agent - A general structure of an agent who interacts with the


environment.
arn

4. Define rational agent.


For each possible percept sequence, a rational agent should select an action that is
expected to maximize its performance measure, given the evidence provided by the
Le

percept sequence and whatever built-in knowledge the agent has. A rational agent
should be autonomous.
w.

5. List down the characteristics of intelligent agent.


Internal characteristics are
ww

Learning/reasoning: an agent has the ability to learn from previous


experience and to successively adapt its own behaviour to the environment.
Reactivity: an agent must be capable of reacting appropriately to influences or
information from its environment.
Autonomy: an agent must have both control over its actions and internal
states. The degree of the agent’s autonomy can be specified. There may need
intervention from the user only for important decisions.
Goal-oriented: an agent has well-defined goals and gradually influence its
environment and so achieve its own goals.
External characteristics are

www.LearnEngineering.in
www.LearnEngineering.in

Communication: an agent often requires an interaction with its environment


to fulfil its tasks, such as human, other agents, and arbitrary information sources.
Cooperation: cooperation of several agents permits faster and better solutions
for complex tasks that exceed the capabilities of a single agent.
Mobility: an agent may navigate within electronic communication networks.
Character: like human, an agent may demonstrate an external behaviour with
many human characters as possible.

6. What are various applications of AI? or What can AI do today?


 Robotic vehicles

n
 Speech recognition

g.i
 Autonomous planning and scheduling
 Game playing
 Spam fighting

rin
 Logistics planning
 Robotics
 Machine Translation ee
7. Are reflex actions (such as flinching from a hot stove) rational? Are they
gin
intelligent?

Reflex actions can be considered rational. If the body is performing the action,
En

then it can be argued that reflex actions are rational because of evolutionary
adaptation. Flinching from a hot stove is a normal reaction, because the body wants to
keep itself out of danger and getting away from something hot is a way to do that.
arn

Reflex actions are also intelligent. Intelligence suggests that there is reasoning
and logic involved in the action itself.
Le

8. Is AI a science, or is it engineering? Or neither or both? Explain.

AI is both science and engineering. Observing and experimenting, which are


w.

at the core of any science, allows us to study artificial intelligence. From what we
learn by observation and experimentation, we are able to engineer new systems that
encompass what we learn and that may even be capable of learning themselves.
ww

9. What are the various agent programs in intelligent systems?


Simple reflex agents
Model-based reflex agents
Goal-based agents
Utility-based agents

www.LearnEngineering.in
www.LearnEngineering.in

10. Define the problem solving agent.


A Problem solving agent is a goal-based agent. It decides what to do by
finding sequence of actions that lead to desirable states. The agent can adopt a goal
and aim at satisfying it. Goal formulation is the first step in problem solving.

11. Define the terms goal formulation and problem formulation.


Goal formulation based on the current situation and the agent’s performance
measure is the first step in problem solving. The agent’s task is to find out which
sequence of actions will get to a goal state.
Problem formulation is the process of deciding what actions and states to

n
consider given a goal.

g.i
12. List the steps involved in simple problem solving agent.
(i) Goal formulation
(ii) Problem formulation

rin
(iii) Search
(iv) Search Algorithm
(v) Execution phase ee
13. What are the components of well-defined problems? (or)
gin
What are the four components to define a problem? Define them?
The four components to define a problem are,
1) Initial state – it is the state in which agent starts in.
2) A description of possible actions – it is the description of possible actions
En

which are available to the agent.


3) The goal test – it is the test that determines whether a given state is goal
arn

(final) state.
4) A path cost function – it is the function that assigns a numeric cost (value)
to each path.
The problem-solving agent is expected to choose a cost function that reflects
Le

its own performance measure.


w.

14. Differentiate toy problems and real world problems?


A toy problem is intended to illustrate various problem solving methods. It can
be easily used by different researchers to compare the performance of algorithms. A
ww

real world problem is one whose solutions people actually care about.

15. Give example for real world end toy problems.


Real world problem examples:
 Airline travel problem.
 Touring problem.
 Traveling salesman problem
 VLSI Layout problem
 Robot navigation

www.LearnEngineering.in
www.LearnEngineering.in

 Automatic Assembly
 Internet searching
Toy problem Examples:
 8 – Queen problem
 8 – Puzzle problem
 Vacuum world problem

16. How will you measure the problem-solving performance?


We can evaluate an algorithm’s performance in four ways:
Completeness: Is the algorithm guaranteed to find a solution when there is

n
one?

g.i
Optimality: Does the strategy find the optimal solution?
Time complexity: How long does it take to find a solution?
Space complexity: How much memory is needed to perform the search?

rin
17. What is the application of BFS?
It is simple search strategy, which is complete i.e. it surely gives solution if
ee
solution exists. If the depth of search tree is small then BFS is the best choice. It is
useful in tree as well as in graph search.
gin
18. State on which basis search algorithms are chosen?
Search algorithms are chosen depending on two components.
1) How is the state space – That is, state space is tree structured or graph?
En

Critical factor for state space is what is branching factor and depth level of that tree or
graph.
2) What is the performance of the search strategy? A complete, optimal search
arn

strategy with better time and space requirement is critical factor in performance of
search strategy.
Le

19. Evaluate performance of problem-solving method based on depth-first search


algorithm?
DFS algorithm performance measurement is done with four ways –
w.

1) Completeness – It is complete (guarantees solution)


2) Optimality – it is not optimal.
ww

3) Time complexity – It’s time complexity is O (b).


4) Space complexity – its space complexity is O (b d+1).

20. List some of the uninformed search techniques.


The uninformed search strategies are those that do not take into account the
location of the goal. That is these algorithms ignore where they are going until they
find a goal and report success. The various uninformed search strategies are
 Breadth-first search
 Uniform-cost search
 Depth-first search

www.LearnEngineering.in
www.LearnEngineering.in

 Depth-limited search
 Iterative deepening depth-first search
 Bidirectional search

21. What is the power of heuristic search? (or) Why does one go for heuristics
search?
Heuristic search uses problem specific knowledge while searching in state
space. This helps to improve average search performance. They use evaluation
functions which denote relative desirability (goodness) of a expanding node set. This
makes the search more efficient and faster. One should go for heuristic search because

n
it has power to solve large, hard problems in affordable times.

g.i
22. What are the advantages of heuristic function?
Heuristics function ranks alternative paths in various search algorithms, at

rin
each branching step, based on the available information, so that a better path is
chosen. The main advantage of heuristic function is that it guides for which state to
explore now, while searching. It makes use of problem specific knowledge like
ee
constraints to check the goodness of a state to be explained. This drastically reduces
the required searching time.
gin
23. State the reason when hill climbing often gets stuck?
Local maxima are the state where hill climbing algorithm is sure to get struck.
Local maxima are the peak that is higher than each of its neighbour states, but lower
En

than the global maximum. So we have missed the better state here. All the search
procedure turns out to be wasted here. It is like a dead end.
arn

24. When a heuristic function h is said to be admissible? Give an admissible


heuristic function for TSP?
Admissible heuristic function is that function which never over estimates the
Le

cost to reach the goal state. It means that h(n) gives true cost to reach the goal state
‘n’. The admissible heuristic for TSP is
a. Minimum spanning tree.
w.

b. Minimum assignment problem


ww

25. What do you mean by local maxima with respect to search technique?
Local maximum is the peak that is higher than each of its neighbour states, but
lowers than the global maximum i.e. a local maximum is a tiny hill on the surface
whose peak is not as high as the main peak (which is a optimal solution). Hill
climbing fails to find optimum solution when it encounters local maxima. Any small
move, from here also makes things worse (temporarily). At local maxima all the
search procedure turns out to be wasted here. It is like a dead end.

26. How can we avoid ridge and plateau in hill climbing?

www.LearnEngineering.in
www.LearnEngineering.in

Ridge and plateau in hill climbing can be avoided using methods like
backtracking, making big jumps. Backtracking and making big jumps help to avoid
plateau, whereas, application of multiple rules helps to avoid the problem of ridges.

27. Differentiate Blind Search and Heuristic Search.


Parameters Blind search Heuristic search
It is also known Uninformed It is also known Informed
Known as
Search Search
Using It doesn’t use knowledge for It uses knowledge for the
Knowledge the searching process. searching process.

n
It finds solution slow as It finds a solution more quickly.
Performance compared to an informed

g.i
search.
Completion It is always complete. It may or may not be complete.
Cost Factor Cost is high. Cost is low.

rin
It consumes moderate time It consumes less time because of
Time
because of slow searching. quick searching.
No suggestion is given There is a direction given about
Direction ee
regarding the solution in it. the solution.
It is lengthier while It is less lengthy while
Implementation
implemented. implemented.
gin
It is comparatively less It is more efficient as efficiency
efficient as incurred cost is takes into account cost and
Efficiency more and the speed of finding performance. The incurred cost
the Breadth-First solution is is less and speed of finding
En

slow. solutions is quick.


Computational Comparatively higher Computational requirements are
requirements computational requirements. lessened.
arn

Size of search Solving a massive search task Having a wide scope in terms of
problems is challenging. handling large search problems.
Example Example
a) Breadth first search a) Best first search
Le

b) Uniform cost search b) Greedy search


Examples of
c) Depth first Search c) A* search
Algorithms
d) Depth limited search d) AO* Search
w.

e) Iterative deepening search e)Hill Climbing Algorithm


f) Bi – Directional Search
ww

28. What is CSP?


CSP are problems whose state and goal test conform to a standard structure
and very simple representation. CSPs are defined using set of variables and a set of
constraints on those variables. The variables have some allowed values from specified
domain. For example – Graph coloring problem.

29. How can minimax also be extended for game of chance?

www.LearnEngineering.in
www.LearnEngineering.in

In a game of chance, we can add extra level of chance nodes in game search tree.
These nodes have successors which are the outcomes of random element. The
minimax algorithm uses probability P attached with chance node di based on this
value. Successor function S(N,di) give moves from position N for outcome di

Part B
1. Enumerate Classical “Water jug Problem”. Describe the state space for this
problem and also give the solution.
2. How to define a problem as state space search? Discuss it with the help of an
example

n
3. Solve the given problem. Describe the operators involved in it.

g.i
Consider a Water Jug Problem : You are given two jugs, a 4-gallon one and
a 3-gallon one. Neither has any measuring markers on it. There is a pump
that can be used to fill the jugs with water. How can you get exactly 2 gallons

rin
of water into the 4-gallon jug ? Explicit Assumptions: A jug can be filled
from the pump, water can be poured out of a jug onto the ground, water can
be poured from one jug to another and that there are no other measuring
ee
devices available.
4. Define the following problems. What types of control strategy is used in the
gin
following problem.
i.The Tower of Hanoi
ii.Crypto-arithmetic
iii.The Missionaries and cannibals problems
En

iv.8-puzzle problem
5. Discuss uninformed search methods with examples.
6. Give an example of a problem for which breadth first search would work
arn

better than depth first search.


7. Explain the algorithm for steepest hill climbing
8. Explain the A* search and give the proof of optimality of A*
Le

9. Explain AO* algorithm with a suitable example. State the limitations in the
algorithm?
10. Explain the nature of heuristics with example. What is the effect of heuristics
w.

accuracy?
11. Explain the various types of hill climbing search techniques.
ww

12. Discuss about constraint satisfaction problem with a algorithm for solving a
crypt arithmetic Problem.
13. Solve the following Crypt arithmetic problem using constraints satisfaction
search procedure.
CROSS
+ROADS
------------
DANGER
----------------

www.LearnEngineering.in
www.LearnEngineering.in

14. Explain alpha-beta pruning algorithm and the Minmax game playing
algorithm with example?
15. Solve the given problem. Describe the operators involved in it.
Consider a water jug problem: You are given two jugs, a 4-gallon one and a
3-gallon one. Neither have any measuring Markers on it. There is a pump
that can be used to fill the jug with water. How can you get exactly 2 gallons
of water into the 4 gallon jug? Explicit Assumptions: A jug can be filled from
the pump, water can be poured out of a jug onto the ground, water can be
poured from one jug to another and that there are no other measuring
devices available.

n
UNIT II PROBABILISTIC REASONING

g.i
Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic
reasoning – Bayesian networks – exact inference in BN – approximate inference in BN –
causal networks.

rin
Part A
1. Why does uncertainty arise? ee
Agents almost never have access to the whole truth about their environment.
Uncertainty arises because of both laziness and ignorance. It is inescapable in
complex, nondeterministic, or partially observable environments
gin
Agents cannot find a categorical answer.
 Uncertainty can also arise because of incompleteness, incorrectness in agents
understanding of properties of environment.
En

2. Differentiate uncertainty with ignorance.


A key condition that differentiates ignorance from uncertainty is the absence
of knowledge about the factors that influence the issues
arn

3. What is the need for probability theory in uncertainty?


Probability provides the way of summarizing the uncertainty that comes from
our laziness and ignorance. Probability statements do not have quite the same kind
Le

of semantics known as evidences.

4. What is the need for utility theory in uncertainty?


Utility theory says that every state has a degree of usefulness, or utility to in
w.

agent, and that the agent will prefer states with higher utility.
ww

5. Define principle of maximum expected utility (MEU)?


`The fundamental idea of decision theory is that an agent is rational if and only if
it chooses the action that yields the highest expected utility, averaged over all the
possible outcomes of the action. This is called the principle of maximum expected
utility (MEU).

6. Mention the needs of probabilistic reasoning in AI.


 When there are unpredictable outcomes.
 When specifications or possibilities of predicates becomes too large to
handle.
 When an unknown error occurs during an experiment.

www.LearnEngineering.in
www.LearnEngineering.in

7. What does the full joint probability distribution specify?


The full joint probability distribution specifies the probability of each
complete assignment of values to random variables. It is usually too large to create
or use in its explicit form, but when it is available it can be used to answer queries
simply by adding up entries for the possible worlds corresponding to the query
propositions.

8. State Bayes' Theorem in Artificial Intelligence.


Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian
reasoning, which determines the probability of an event with uncertain
knowledge. It is a way to calculate the value of P(B|A) with the knowledge of

n
P(A|B). Bayes' theorem allows updating the probability prediction of an event by
observing new information of the real world.

g.i
Example: If cancer corresponds to one's age then by using Bayes' theorem,
we can determine the probability of cancer more accurately with the help of age.

rin
P(A/B)=[P(A)*P(B/A)]/P(B)

9. Given that P(A)=0.3,P(A|B)=0.4 and P(B)=0.5, Compute P(B|A).


ee
gin

0.4 = (0.3*P(B/A))/0.5
En

P(B/A) = 0.66

10. What is Bayesian Belief Network?


A Bayesian network is a probabilistic graphical model which represents a set
arn

of variables and their conditional dependencies using a directed acyclic graph. It is


also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a
Le

probability distribution, and also use probability theory for prediction and
anomaly detection.
A Bayesian network is a directed graph in which each node is annotated with
quantitative probability information. The full specification is as follows:
w.

1. Each node corresponds to a random variable, which may be discrete or


continuous.
ww

2. A set of directed links or arrows connects pairs of nodes. If there is an arrow


from node X to node Y , X is said to be a parent of Y. The graph has no directed
cycles (and hence is a directed acyclic graph, or DAG).
3. Each nodeXi has a conditional probability distribution P(Xi |Parents(Xi))
that quantifies the effect of the parents on the node.

Part B

1. How to get the exact inference form Bayesian network?

www.LearnEngineering.in
www.LearnEngineering.in

2. Explain variable elimination algorithm for answering queries on Bayesian


networks?
3. Define uncertain knowledge, prior probability and conditional probability.
State the Bayes’ theorem. How it is useful for decision making under
uncertainty? Explain belief networks briefly?
4. Explain the method of handling approximate inference in Bayesian networks.
5. What is Bayes’ rule? Explain how Bayes’ rule can be applied to tackle
uncertain Knowledge.
6. Discuss about Bayesian Theory and Bayesian network.
7. Explain how does Bayesian statistics provide reasoning under various kinds
of uncertainty?

n
8. How to get the approximate inference from Bayesian network.
9. Construct a Bayesian Network and define the necessary CPTs for the given

g.i
scenario. We have a bag of three biased coins a,b and c with probabilities of
coming up heads of 20%, 60% and 80% respectively. One coin is drawn
randomly from the bag (with equal likelihood of drawing each of the three

rin
coins) and then the coin is flipped three times to generate the outcomes X1,
X2 and X3.
a. Draw a Bayesian network corresponding to this setup and define the
ee
relevant CPTs.
b. Calculate which coin is most likely to have been drawn if the flips come
up HHT
gin
10. Consider the following set of propositions
Patient has spots
Patient has measles
En

Patient has high fever


Patient has Rocky mountain spotted fever.
Patient has previously been inoculated against measles.
arn

Patient was recently bitten by a tick


Patient has an allergy.
a) Create a network that defines the casual connections among these nodes.
Le

b) Make it a Bayesian network by constructing the necessary conditional


probability matrix.
w.
ww

www.LearnEngineering.in
www.LearnEngineering.in

UNIT III SUPERVISED LEARNING


Introduction to machine learning – Linear Regression Models: Least squares, single
& multiple variables, Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function – Probabilistic discriminative model - Logistic regression,
Probabilistic generative model – Naive Bayes, Maximum margin classifier – Support vector
machine, Decision Tree, Random forests.

PART - A

1. What is Machine Learning?


Machine learning is a branch of computer science which deals with system

n
programming in order to automatically learn and improve with experience. For
example: Robots are programed so that they can perform the task based on data

g.i
they gather from sensors. It automatically learns programs from data.

2. Mention the difference between Data Mining and Machine learning?

rin
Machine learning relates with the study, design and development of the
algorithms that give computers the capability to learn without being explicitly
programmed. While, data mining can be defined as the process in which the
ee
unstructured data tries to extract knowledge or unknown interesting patterns.

3. What is ‘Overfitting’ in Machine learning?


gin
In machine learning, when a statistical model describes random error or noise
instead of underlying relationship ‘overfitting’ occurs. When a model is
excessively complex, overfitting is normally observed, because of having too
many parameters with respect to the number of training data types. The model
En

exhibits poor performance which has been overfit.

4. Why overfitting happens?


arn

The possibility of overfitting exists as the criteria used for training the model
is not the same as the criteria used to judge the efficacy of a model.

5. How can you avoid overfitting?


Le

By using a lot of data overfitting can be avoided, overfitting happens relatively


as you have a small dataset, and you try to learn from it. But if you have a small
database and you are forced to come with a model based on that. In such situation,
w.

you can use a technique known as cross validation. In this method the dataset
splits into two section, testing and training datasets, the testing dataset will only
test the model while, in training dataset, the datapoints will come up with the
ww

model. In this technique, a model is usually given a dataset of a known data on


which training (training data set) is run and a dataset of unknown data against
which the model is tested. The idea of cross validation is to define a dataset to
“test” the model in the training phase.

6. What are the five popular algorithms of Machine Learning?


 Decision Trees
 Neural Networks (back propagation)
 Probabilistic networks
 Nearest Neighbor
 Support vector machines

www.LearnEngineering.in
www.LearnEngineering.in

7. What are the different Algorithm techniques in Machine Learning?


The different types of techniques in Machine Learning are

 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
 Transduction
 Learning to Learn

n
8. What are the three stages to build the hypotheses or model in machine

g.i
learning?
 Model building
 Model testing
 Applying the model

rin
9. What is the standard approach to supervised learning?
The standard approach to supervised learning is to split the set of example into
ee
the training set and the test.
gin
10. What is ‘Training set’ and ‘Test set’?
In various areas of information science like machine learning, a set of data is
used to discover the potentially predictive relationship known as ‘Training Set’.
Training set is an examples given to the learner, while Test set is used to test the
En

accuracy of the hypotheses generated by the learner, and it is the set of example
held back from the learner. Training set are distinct from Test set.
arn

11. What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on
empirical data are known as Machine Learning. While artificial intelligence in
addition to machine learning, it also covers other aspects like knowledge
Le

representation, natural language processing, planning, robotics etc.

12. What are the advantages of Naive Bayes?


w.

In Naïve Bayes classifier will converge quicker than discriminative models


like logistic regression, so you need less training data. The main advantage is that
it can’t learn interactions between features.
ww

13. What is the main key difference between supervised and unsupervised
machine learning?
supervised learning Unsupervised learning
The supervised learning technique needs Unsupervised learning does not
labelled data to train the model. For need any labelled dataset. This is
example, to solve a classification problem the main key difference between
(a supervised learning task), you need to supervised learning and
have label data to train the model and to unsupervised learning.
classify the data into your labelled groups.

www.LearnEngineering.in
www.LearnEngineering.in

14. What is a Linear Regression?


In simple terms, linear regression is adopting a linear approach to modeling
the relationship between a dependent variable (scalar response) and one or more
independent variables (explanatory variables). In case you have one explanatory
variable, you call it a simple linear regression. In case you have more than one
independent variable, you refer to the process as multiple linear regressions.

15. What are the disadvantages of the linear regression model?


One of the most significant demerits of the linear model is that it is sensitive
and dependent on the outliers. It can affect the overall result. Another notable

n
demerit of the linear model is overfitting. Similarly, underfitting is also a
significant disadvantage of the linear model.

g.i
16. What is the difference between classification and regression?
Classification is used to produce discrete results; classification is used to

rin
classify data into some specific categories. For example, classifying emails into
spam and non-spam categories. Whereas, we use regression analysis when we are
dealing with continuous data, for example predicting stock prices at a certain point
in time. ee
17. What is the difference between stochastic gradient descent (SGD) and
gin
gradient descent (GD)?
Both algorithms are methods for finding a set of parameters that minimize a
loss function by evaluating parameters against data and then making adjustments.
In standard gradient descent, you'll evaluate all training samples for each set of
En

parameters. This is akin to taking big, slow steps toward the solution. In stochastic
gradient descent, you'll evaluate only 1 training sample for the set of parameters
before updating them. This is akin to taking small, quick steps toward the solution.
arn

18. What are the different types of least squares?


Least squares problems fall into two categories: linear or ordinary least
squares and nonlinear least squares, depending on whether or not the residuals are
Le

linear in all unknowns. The linear least-squares problem occurs in statistical


regression analysis; it has a closed-form solution.
w.

19. What is the difference between least squares regression and multiple
regression?
The goal of multiple linear regression is to model the linear relationship
ww

between the explanatory (independent) variables and response (dependent)


variables. In essence, multiple regression is the extension of ordinary least-squares
(OLS) regression because it involves more than one explanatory variable.

20. What is the principle of least squares?


Principle of Least Squares" states that the most probable values of a system of
unknown quantities upon which observations have been made, are obtained by
making the sum of the squares of the errors a minimum.

21. What are some advantages to using Bayesian linear regression?

www.LearnEngineering.in
www.LearnEngineering.in

Doing Bayesian regression is not an algorithm but a different approach to


statistical inference. The major advantage is that, by this Bayesian processing, you
recover the whole range of inferential solutions, rather than a point estimate and a
confidence interval as in classical regression.

22. What Is Bayesian Linear Regression?


In Bayesian linear regression, the mean of one parameter is characterized by a
weighted sum of other variables. This type of conditional modeling aims to
determine the prior distribution of the regressors as well as other variables
describing the allocation of the regress and eventually permits the out-of-sample
forecasting of the regress and conditional on observations of the regression

n
coefficients.

g.i
23. What are the advantages of Bayesian Regression?
 Extremely efficient when the dataset is tiny.
 Particularly well-suited for online learning as opposed to batch learning,

rin
when we know the complete dataset before we begin training the model.
This is so that Bayesian Regression can be used without having to save
data.
ee
 The Bayesian technique has been successfully applied and is quite strong
mathematically. Therefore, using this requires no additional prior
knowledge of the dataset.
gin

24. What are the disadvantages of Bayesian Regression?


 The model's inference process can take some time.
 The Bayesian strategy is not worthwhile if there is a lot of data accessible
En

for our dataset, and the regular probability approach does the task more
effectively.
arn

25. What are types of classification models?


 Logistic Regression
 Naive Bayes
 K-Nearest Neighbors
Le

 Decision Tree
 Support Vector Machines
w.

26. Why is random forest better than SVM?


Random Forest is intrinsically suited for multiclass problems, while SVM is
ww

intrinsically two-class. For multiclass problem you will need to reduce it into
multiple binary classification problems. Random Forest works well with a mixture
of numerical and categorical features.

27. Which is better linear regression or random forest?


Multiple linear regression is often used for prediction in neuroscience.
Random forest regression is an alternative form of regression. It does not make the
assumptions of linear regression. We show that linear regression can be superior
to random forest regression.

28. Which is better linear or tree based models?

www.LearnEngineering.in
www.LearnEngineering.in

If there is a high non-linearity & complex relationship between dependent &


independent variables, a tree model will outperform a classical regression method.
If you need to build a model which is easy to explain to people, a decision tree
model will always do better than a linear model.

29. Is linear discriminant analysis classification or regression?


Linear Discriminant Analysis is a simple and effective method for
classification. Because it is simple and so well understood, there are many
extensions and variations to the method.

30. What is probabilistic discriminative model?

n
Discriminative models are a class of supervised machine learning models
which make predictions by estimating conditional probability P(y|x). In order to

g.i
use a generative model, more unknowns should be solved: one has to estimate
probability of each class and probability of observation given class.

rin
31. What is SVM?
It is a supervised learning algorithm used both for classification and regression
problems. A type of discriminative modelling, support vector machine (SVM)
ee
creates a decision boundary to segregate n-dimensional space into classes. The
best decision boundary is called a hyperplane created by choosing the extreme
points called the support vectors.
gin
32. What is Decision tree?
A type of supervised machine learning model where data is continuously split
according to certain parameters. It has two main entities–decision nodes and
En

leaves. While leaves are the final outcomes or decisions, nodes are the points
where data is split.
arn

33. What is Random forest?


It is a flexible and easy-to-use machine learning algorithm that gives great
results without even using hyper-parameter tuning. Because of its simplicity and
diversity, it is one of the most used algorithms for both classification and
Le

regression tasks.
34. What is Decision Tree Classification?
A decision tree builds classification (or regression) models as a tree structure,
w.

with datasets broken up into ever-smaller subsets while developing the decision
tree, literally in a tree-like way with branches and nodes. Decision trees can
handle both categorical and numerical data.
ww

35. What Is Pruning in Decision Trees, and How Is It Done?


Pruning is a technique in machine learning that reduces the size of decision
trees. It reduces the complexity of the final classifier, and hence improves
predictive accuracy by the reduction of overfitting.
Pruning can occur in:
● Top-down fashion. It will traverse nodes and trim subtrees starting at the
root
● Bottom-up fashion. It will begin at the leaf nodes
There is a popular pruning algorithm called reduced error pruning, in which:
● Starting at the leaves, each node is replaced with its most popular class

www.LearnEngineering.in
www.LearnEngineering.in

● If the prediction accuracy is not affected, the change is kept


● There is an advantage of simplicity and speed

36. Do you think 50 small decision trees are better than a large one? Why?
Yes. Because a random forest is an ensemble method that takes many weak
decision trees to make a strong learner. Random forests are more accurate, more
robust, and less prone to overfitting.

37. You’ve built a random forest model with 10000 trees. You got delighted after
getting training error as 0.00. But, the validation error is 34.23. What is going
on? Haven’t you trained your model perfectly?

n
The model has overfitted. Training error 0.00 means the classifier has
mimicked the training data patterns to an extent, that they are not available in the

g.i
unseen data. Hence, when this classifier was run on an unseen sample, it couldn’t
find those patterns and returned predictions with higher error. In a random forest,
it happens when we use a larger number of trees than necessary. Hence, to avoid

rin
this situation, we should tune the number of trees using cross-validation.

38. When would you use random forests vs SVM and why?
ee
There are a couple of reasons why a random forest is a better choice of the
model than asupport vector machine:
● Random forests allow you to determine the feature importance. SVM’s can’t
gin
do this.
● Random forests are much quicker and simpler to build than an SVM.
● For multi-class classification problems, SVMs require a one-vs-rest method,
which is less scalable and more memory intensive.
En

Part – B
1. Assume a disease so rare that it is seen in only one person out of every
arn

million. Assume also that we have a test that is effective in that if a person has
the disease, there is a 99 percent chance that the test result will be positive;
however, the test is not perfect, and there is a one in a thousand chance that
the test result will be positive on a healthy person. Assume that a new patient
Le

arrives and the test result is positive. What is the probability that the patient
has the disease?
2. Explain Naïve Bayes Classifier with an Example.
w.

3. Explain SVM Algorithm in Detail.


4. Explain Decision Tree Classification.
5. Explain the principle of the gradient descent algorithm. Accompany your
ww

explanation with a diagram. Explain the use of all the terms and constants
that you introduce and comment on the range of values that they can take.
6. Explain the following
a) Linear regression
b) Logistic Regression

www.LearnEngineering.in
www.LearnEngineering.in

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING


Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -
bagging, boosting, stacking, Unsupervised learning: K-means, Instance Based Learning:
KNN, Gaussian mixture models and Expectation maximization

PART - A
1. What is bagging and boosting in ensemble learning?
Bagging is a way to decrease the variance in the prediction by generating additional
data for training from dataset using combinations with repetitions to produce multi-sets of the
original data. Boosting is an iterative technique which adjusts the weight of an observation

n
based on the last classification.

g.i
2. What is stacking in ensemble learning?
Stacking is one of the most popular ensemble machine learning techniques used to
predict multiple nodes to build a new model and improve model performance. Stacking

rin
enables us to train multiple models to solve similar problems, and based on their combined
output, it builds a new model with improved performance.
ee
3. Which are the three types of ensemble learning?
The three main classes of ensemble learning methods are bagging, stacking, and
boosting, and it is important to both have a detailed understanding of each method and to
gin
consider them on your predictive modeling project.

4. Why ensemble methods are used?


There are two main reasons to use an ensemble over a single model, and they are
En

related; they are: Performance: An ensemble can make better predictions and achieve better
performance than any single contributing model. Robustness: An ensemble reduces the
spread or dispersion of the predictions and model performance.
arn

5. What is a voting classifier?


A voting classifier is a machine learning estimator that trains various base models or
estimators and predicts on the basis of aggregating the findings of each base estimator. The
Le

aggregating criteria can be combined decision of voting for each estimator output

6. What type of classifiers are used in weighted voting method?


w.

The performance-weighted-voting model integrates five classifiers including logistic


regression, SVM, random forest, XGBoost and neural networks. We first used cross-
validation to get the predicted results for the five classifiers.
ww

7. What is difference between K means and Gaussian mixture?


K-Means is a simple and fast clustering method, but it may not truly capture
heterogeneity inherent in Cloud workloads. Gaussian Mixture Models can discover
complex patterns and group them into cohesive, homogeneous components that are
close representatives of real patterns within the data set.

8. What are Gaussian mixture models How is expectation maximization used in it?
Expectation maximization provides an iterative solution to maximum
likelihood estimation with latent variables. Gaussian mixture models are an approach

www.LearnEngineering.in
www.LearnEngineering.in

to density estimation where the parameters of the distributions are fit using the
expectation-maximization algorithm.
9. What is k-means unsupervised learning?
K-Means clustering is an unsupervised learning algorithm. There is no labeled
data for this clustering, unlike in supervised learning. K-Means performs the division
of objects into clusters that share similarities and are dissimilar to the objects
belonging to another cluster. The term 'K' is a number.

10. What is the difference between K-means and KNN?


KNN is a supervised learning algorithm mainly used for classification
problems, whereas K-Means (aka K-means clustering) is an unsupervised learning

n
algorithm. K in K-Means refers to the number of clusters, whereas K in KNN is the
number of nearest neighbors (based on the chosen distance metric).

g.i
11. What is expectation maximization algorithm used for?
` The EM algorithm is used to find (local) maximum likelihood parameters of a

rin
statistical model in cases where the equations cannot be solved directly. Typically
these models involve latent variables in addition to unknown parameters and known
data observations.
ee
12. What is the advantage of Gaussian process?
Gaussian processes are a powerful algorithm for both regression and
gin
classification. Their greatest practical advantage is that they can give a reliable
estimate of their own uncertainty.

13. What are examples of unsupervised learning?


En

Some examples of unsupervised learning algorithms include K-Means


Clustering, Principal Component Analysis and Hierarchical Clustering.
arn

14. How do you implement expectation maximization algorithm?


The two steps of the EM algorithm are:
E-step: perform probabilistic assignments of each data point to some class
based on the current hypothesis h for the distributional class parameters;
Le

M-step: update the hypothesis h for the distributional class parameters based
on the new data assignments.
w.

15. What is the principle of maximum likelihood?


The principle of maximum likelihood is a method of obtaining the optimum
values of the parameters that define a model. And while doing so, you increase the
ww

likelihood of your model reaching the “true” model.

Part – B
1. Explain briefly about unsupervised learning structure?
2. Explain various learning techniques involved in unsupervised learning?
3. What is Gaussian process? And explain in detail of Gaussian parameter
estimates with suitable examples.
4. Explain the concepts of clustering approaches. How it differ from classification.
5. List the applications of clustering and identify advantages and disadvantages of
clustering algorithm.
6. Explain about EM algorithm.

www.LearnEngineering.in
www.LearnEngineering.in

7. List non-parametric techniques and Explain K-nearest neighbour estimation.

UNIT V NEURAL NETWORKS


Perceptron - Multilayer perceptron, activation functions, network training – gradient
descent optimization – stochastic gradient descent, error backpropagation, from shallow
networks to deep networks –Unit saturation (aka the vanishing gradient problem) – ReLU,
hyperparameter tuning, batch normalization, regularization, dropout.

1. What is perceptron and its types?


A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network. Neural
Networks are the building blocks of Machine Learning.

n
g.i
2. Which activation function is used in multilayer perceptron?
Image result for Perceptron - Multilayer perceptron, activation functions
The Sigmoid Activation Function: Activation in Multilayer Perceptron Neural Networks.

rin
3. What are the activation functions of MLP?
In MLP and CNN neural network models, ReLU is the default activation function for
ee
hidden layers. In RNN neural network models, we use the sigmoid or tanh function for
hidden layers. The tanh function has better performance. Only the identity activation function
is considered linear.
gin
4. Does MLP have activation function?
Multilayer perceptrons (MLP) has been proven to be very successful in many applications
including classification. The activation function is the source of the MLP power. Careful
En

selection of the activation function has a huge impact on the network performance.

5. What is the difference between a perceptron and a MLP?


arn

The Perceptron was only capable of handling linearly separable data hence the multi-layer
perception was introduced to overcome this limitation. An MLP is a neural network capable
of handling both linearly separable and non-linearly separable data.
Le

6. What are the types of activation function?


Popular types of activation functions and when to use them
 Binary Step Function
w.

 Linear Function
 Sigmoid
 Tanh
ww

 ReLU
 Leaky ReLU
 Parameterised ReLU
 Exponential Linear Unit

7. What is MLP and how does it work?


A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes
connected as a directed graph between the input and output layers. MLP uses
backpropogation for training the network.

www.LearnEngineering.in
www.LearnEngineering.in

8. Why do you require Multilayer Perceptron?


MLPs are useful in research for their ability to solve problems stochastically, which often
allows approximate solutions for extremely complex problems like fitness approximation.

9. What are the advantages of Multilayer Perceptron?


Advantages of Multi-Layer Perceptron:
A multi-layered perceptron model can be used to solve complex non-linear problems.
It works well with both small and large input data.
It helps us to obtain quick predictions after the training.
It helps to obtain the same accuracy ratio with large as well as small data.

n
10. What do you mean by activation function?

g.i
An activation function is a function used in artificial neural networks which outputs a
small value for small inputs, and a larger value if its inputs exceed a threshold. If the inputs
are large enough, the activation function "fires", otherwise it does nothing.

rin
11. What are the limitations of perceptron?
Perceptron networks have several limitations. First, the output values of a perceptron can
ee
take on only one of two values (0 or 1) because of the hard-limit transfer function. Second,
perceptrons can only classify linearly separable sets of vectors.
gin
12. How many layers are there in perceptron?
This is known as a two-layer perceptron. It consists of two layers of neurons. The first
layer is known as hidden layer, and the second layer, known as the output layer, consists of a
single neuron.
En

13. is stochastic gradient descent same as gradient descent?


Compared to Gradient Descent, Stochastic Gradient Descent is much faster, and more
arn

suitable to large-scale datasets. But since the gradient it's not computed for the entire dataset,
and only for one random point on each iteration, the updates have a higher variance.

14. How is stochastic gradient descent used as an optimization technique?


Le

Stochastic gradient descent is an optimization algorithm often used in machine learning


applications to find the model parameters that correspond to the best fit between predicted
and actual outputs. It's an inexact but powerful technique. Stochastic gradient descent is
w.

widely used in machine learning applications.

15. Does stochastic gradient descent lead to faster training?


ww

Gradient Descent is the most common optimization algorithm and the foundation of how
we train an ML model. But it can be really slow for large datasets. That's why we use a
variant of this algorithm known as Stochastic Gradient Descent to make our model learn a lot
faster.

16. What is stochastic gradient descent and why is it used in the training of neural
networks?
Stochastic Gradient Descent is an optimization algorithm that can be used to train neural
network models. The Stochastic Gradient Descent algorithm requires gradients to be
calculated for each variable in the model so that new values for the variables can be
calculated.

www.LearnEngineering.in
www.LearnEngineering.in

17. What are the three main types gradient descent algorithm?
There are three types of gradient descent learning algorithms: batch gradient descent,
stochastic gradient descent and mini-batch gradient descent.

18. What are the disadvantages of stochastic gradient descent?


SGD is much faster but the convergence path of SGD is noisier than that of original
gradient descent. This is because in each step it is not calculating the actual gradient but an
approximation. So we see a lot of fluctuations in the cost.

19. How do you solve the vanishing gradient problem within a deep neural network?

n
The vanishing gradient problem is caused by the derivative of the activation function used
to create the neural network. The simplest solution to the problem is to replace the activation

g.i
function of the network. Instead of sigmoid, use an activation function such as ReLU

20. What is the problem with ReLU?

rin
Key among the limitations of ReLU is the case where large weight updates can mean that
the summed input to the activation function is always negative, regardless of the input to the
network. This means that a node with this problem will forever output an activation value of
0.0. This is referred to as a “dying ReLU“ ee
21. Why is ReLU used in deep learning?
gin
The ReLU function is another non-linear activation function that has gained popularity in
the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of
using the ReLU function over other activation functions is that it does not activate all the
neurons at the same time.
En

22. Why is ReLU better than Softmax?


As per our business requirement, we can choose our required activation function.
arn

Generally , we use ReLU in hidden layer to avoid vanishing gradient problem and better
computation performance , and Softmax function use in last output layer .

Part – B
Le

1. Draw the architecture of a single layer perceptron (SLP) and explain its
operation. Mention its advantages and disadvantages.
w.

2. Draw the architecture of a Multilayer perceptron (MLP) and explain its


operation. Mention its advantages and disadvantages.
3. Explain the stochastic optimization methods for weight determination.
ww

4. Describe back propagation and features of back propagation.


5. Write the flowchart of error back-propagation training algorithm.
6. Develop a Back propagation algorithm for Multilayer Feed forward neural
network consisting of one input layer, one hidden layer and output layer from
first principles.
7. List the factors that affect the performance of multilayer feed-forward neural
network.
8. Difference between a Shallow Net & Deep Learning Net.
9. How do you tune hyperparameters for better neural network performance?
Explain in detail.

www.LearnEngineering.in
lOMoARcPSD|26586732

UNIT – 1 (8 Marks and 16 Marks)


1.Is AI is a science or is it engineering? Or neither or both?
Explain. (AU-MAY 2012).

Human vs. Machine


Everyone knows that humans and machines are different. Machines are the creation of humans, and
they were created to make their work easier.

 Humans depend more and more on machines for their day-to-day things. Machines have
created a revolution, and no human can think of a life without machines.

 A machine is only a device consisting of different parts, and is used for performing different
functions. They do not have life, as they are mechanical. On the other hand, humans are made
of flesh and blood; life is just not mechanical for humans.

 Humans have feelings and emotions, and they can express these emotions; happiness and
sorrow are part of one’s life. On the other hand, machines have no feelings and emotions.
They just work as per the details fed into their mechanical brain.

 Humans have the capability to understand situations, and behave accordingly. On the
contrary, Machines do not have this capability.

 While humans behave according to their consciousness, machines perform as they are taught.
Humans perform activities as per their own intelligence. On the contrary, machines only have
an artificial intelligence.

 It is a man-made intelligence that the machines have. The brilliance of the intelligence of a
machine depends on the intelligence of the humans that created it.

 Another striking difference that can be seen is that humans can do anything original, and
machines cannot. Machines have limitations to their performance because they need humans
to guide them.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Humans can do anything original, and machines cannot.

 Humans have the capability to understand situations, and behave accordingly. On the
contrary, machines do not have this capability.

 While humans behave as per their consciousness, machines just perform as they are taught.

2. Explain the schematic of AI’S agent performing action. (Dec-09, 12, 14, May-12).

3. Explain the role of an agent program. (Dec-09,16)


Following diagram illustrates the agent’s action process, as specified by architecture. This can
also be termed as agent’s structure.

Fig: agent’s action process


 An agent function program is internally implemented as agent function.
 An agent program takes input as the current percept from the sensor and returns an action to
the effectors (actuators).

4. Discuss any 2 uninformed search methods with examples. (Dec-2009), explain


the following uninformed search strategies. I) IDDFS AND 2) Bidirectional
search. (May 2010); what is uninformed search and explain depth first search
with example. (May-2013, Dec 2013 , may 2014; May 2015, Dec 2016)
An Uninformed search is a group of wide range usage algorithms of the era. These
algorithms are brute force operations, and they don’t have extra information about the search space;
the only information they have is on how to traverse or visit the nodes in the tree. Thus uninformed

Downloaded from STUCOR APP


lOMoARcPSD|26586732

search algorithms are also called blind search algorithms. The search algorithm produces the search
tree without using any domain knowledge, which is the brute force in nature. They are different from
informed search algorithms in a way that you check for a goal when a node is generated or expanded,
and they don’t have any background information on how to approach the goal.

TYPES OF UNINFORMED SEARCH ALGORITHMS


 Breadth-First Search Algorithms
BFS is a search operation for finding the nodes in a tree. The algorithm works breadth wise and
traverses to find the desired node in a tree. It starts searching operation from the root nodes and
expands the successor nodes at that level before moving ahead and then moves along breadth wise for
further expansion.

 It occupies a lot of memory space, and time to execute when the solution is at the bottom or
end of the tree and uses the FIFO queue.
 Time Complexity of BFS is expressed as T (n) = 1+n2+n3+…….+ nd= O (nd) and;
 Space Complexity of BFS is O (nd).
 The breadth-first search algorithm is complete.
 The optimal solution is possible to obtain from BFS.

 Depth First Search Algorithms


DFS is one of the recursive algorithms we know. It traverses the graph or a tree depth-wise. Thus it is
known to be a depth-first search algorithm as it derives its name from the way it functions. The DFS
uses the stack for its implementation. The process of search is similar to BFS. The only difference lies
in the expansion of nodes which is depth-wise in this case.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Unlike the BFS, the DFS requires very less space in the memory because of the way it stores
the nodes stack only on the path it explores depth-wise.
 In comparison to BFS, the execution time is also less if the expansion of nodes is correct. If
the path is not correct, then the recursion continues, and there is no guarantee that one may
find the solution. This may result in an infinite loop formation.
 The DFS is complete only with finite state space.
 Time Complexity is expressed as T(n) = 1+ n2+ n3+………+ nm=O(nm).
 The Space Complexity is expressed as O (bm).
 The DFS search algorithm is not optimal, and it may generate large steps and possibly high
cost to find the solution.

 Depth Limited Search Algorithm


The DLS algorithm is one of the uninformed strategies. A depth limited search is close to DFS to
some extent. It can find the solution to the demerit of DFS. The nodes at the depth may behave as if
no successor exists at the depth. Depth-limited search can be halted in two cases:

o SFV: The Standard failure value which tells that there is no solution to the problem.
o CFV: The Cutoff failure value tells that there is no solution within the given depth.

 The DLS is efficient in memory space utilization.


 Time Complexity is expressed as O(bℓ).
 Space Complexity is expressed as O(b×ℓ).
 It has the demerit of incompleteness. It is complete only if the solution is above the depth
limit.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Uniform-cost Search Algorithm


The UCS algorithm is used for visiting the weighted tree. The main goal of the uniform cost search is
to fetch a goal node and find the true path, including the cumulative cost. The following are the
properties of the UCS algorithm:

 The expansion takes place on the basis of cost from the root. The UCS is implemented using a
priority queue.
 The UCS does not care for the number of steps, and so it may end up an infinite loop.
 The uniform-cost search algorithm is known to be complete.
 Time Complexity can be expressed as O(b1 + [C*/ε])/
 Space Complexity is expressed as O(b1 + [C*/ε]).
 We can say that UCs is the optimal algorithm as it chooses the path with the lowest cost only.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Iterative deepening depth-first Search


This algorithm is a combination of BFS and DFS searching techniques. It is iterative in nature. The
best depth is found using it. The algorithm is set to search only at a certain depth. The depth keeps
increasing at each recursive step until it finds the goal node.

 The power of BFS and DFS combination is observed in this algorithm.


 When the search space is large, it proves itself, and the depth is not known.
 This algorithm has one demerit, and it is that it iterates all the previous steps.
 The algorithm is known to be complete only if the branching factor is known r finite.
 Time Complexity is expressed as O(bd).
 Space Complexity is expressed as O(bd).
 This algorithm is optimal.

 Bidirectional Search Algorithm


The Two way or Bidirectional search algorithm executes in a way that t has to run two searches
simultaneously one in a forward direction and the other in the backward direction. The search will
stop when the two simultaneous searches intersect each other to find the goal node. It is free to use
any search algorithm discussed above, like BFS, DFS, etc.

 Bidirectional search is quick and occupies less memory.


 The implementation is difficult, and the goal node should be known in advance to execute it.
 The Bidirectional Search algorithm is found to be complete and optimal.
 Time Complexity is expressed as O(bd).
 Space Complexity is expressed as O(bd).

Downloaded from STUCOR APP


lOMoARcPSD|26586732

5. What is game and applications of game theory? (May-03)


Game:
The term game means a sort of conflict in which n individuals or groups participate
Game theory denotes games of strategy. Games are integral attribute of human beings. Games engage
the intellectual faculties of humans. If computers are to mimic people they should be able to play
games.

Applications of Game Theory

The following are just a few examples of game theory applications:


 Stock trades and the investors’ reactions and decisions against stock market developments and
the behaviors and decisions of other investors
 OPEC member countries’ decision to change the amount of oil extraction and sale and their
compliance or non-compliance with quota arrangements
 Corporate behavior regarding product pricing in monopoly or multilateral competition
markets
 Animal interaction with one another in social life (hunting or sharing achievements or
supporting each other)

6. Explain the Mini-Max algorithm and how it is work for game tic-tac-toe. (dec-
03,04, May -09,10, 17 , 19)
 Mini-max algorithm is a recursive or backtracking algorithm which is used in decision-
making and game theory. It provides an optimal move for the player assuming that opponent is
also playing optimally.

 Mini-Max algorithm uses recursion to search through the game-tree.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Min-Max algorithm is mostly used for game playing in AI. Such as Chess, Checkers, tic-tac-
toe, go, and various tow-players game. This Algorithm computes the minimax decision for the
current state.

 In this algorithm two players play the game, one is called MAX and other is called MIN.

 Both the players fight it as the opponent player gets the minimum benefit while they get the
maximum benefit.

 Both Players of the game are opponent of each other, where MAX will select the maximized
value and MIN will select the minimized value.

 The minimax algorithm performs a depth-first search algorithm for the exploration of the
complete game tree.

 The minimax algorithm proceeds all the way down to the terminal node of the tree, then
backtrack the tree as the recursion.

Pseudo-code for MinMax Algorithm:

function minimax(node, depth, maximizingPlayer) is


if depth ==0 or node is a terminal node then
return static evaluation of node

if MaximizingPlayer then // for Maximizer Player


maxEva= -infinity
for each child of node do
eva= minimax(child, depth-1, false)
maxEva= max(maxEva,eva) //gives Maximum of the values
return maxEva

else // for Minimizer player


minEva= +infinity
for each child of node do
eva= minimax(child, depth-1, true)
minEva= min(minEva, eva) //gives minimum of the values
return minEva

Working of Min-Max Algorithm:


o The working of the minimax algorithm can be easily described using an example. Below we
have taken an example of game-tree which is representing the two-player game.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o In this example, there are two players one is called Maximizer and other is called Minimizer.
o Maximizer will try to get the Maximum possible score, and Minimizer will try to get the
minimum possible score.
o This algorithm applies DFS, so in this game-tree, we have to go all the way through the leaves
to reach the terminal nodes.
o At the terminal node, the terminal values are given so we will compare those value and
backtrack the tree until the initial state occurs. Following are the main steps involved in
solving the two-player game tree:

Step-1: In the first step, the algorithm generates the entire game-tree and applies the utility function
to get the utility values for the terminal states. In the below tree diagram, let's take A is the initial state
of the tree. Suppose maximizer takes first turn which has worst-case initial value =- infinity, and
minimizer will take next turn which has worst-case initial value = +infinity.

Step 2: Now, first we find the utilities value for the Maximizer, its initial value is -∞, so we will
compare each value in terminal state with initial value of Maximizer and determines the higher nodes
values. It will find the maximum among the all.
o For node D max(-1,- -∞) => max(-1,4)= 4
o For Node E max(2, -∞) => max(2, 6)= 6
o For Node F max(-3, -∞) => max(-3,-5) = -3
o For node G max(0, -∞) = max(0, 7) = 7

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step 3: In the next step, it's a turn for minimizer, so it will compare all nodes value with +∞, and will
find the 3rd layer node values.
o For node B= min(4,6) = 4
o For node C= min (-3, 7) = -3

Step 4: Now it's a turn for Maximizer, and it will again choose the maximum of all nodes value and
find the maximum value for the root node. In this game tree, there are only 4 layers, hence we reach
immediately to the root node, but in real games, there will be more than 4 layers.
o For node A max(4, -3)= 4

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Properties of Mini-Max
algorithm:
o Complete- Min-Max algorithm is
Complete. It will definitely find a
solution (if exist), in the finite
search tree.
o Optimal- Min-Max algorithm is
optimal if both opponents are
playing optimally.
o Time complexity- As it performs
DFS for the game-tree, so the
time complexity of Min-Max
algorithm is O(bm), where b is branching factor of the game-tree, and m is the maximum
depth of the tree.
o Space Complexity- Space complexity of Mini-max algorithm is also similar to DFS which
is O(bm).

Limitation of the minimax Algorithm:


The main drawback of the minimax algorithm is that it gets really slow for complex games such as
Chess, go, etc. This type of games has a huge branching factor, and the player has lots of choices to
decide. This limitation of the minimax algorithm can be improved from alpha-beta pruning.
Describing Minimax
The key to the Minimax algorithm is a back and forth between the two players, where
the player whose "turn it is" desires to pick the move with the maximum score. In turn,
the scores for each of the available moves are determined by the opposing player
deciding which of its available moves has the minimum score. And the scores for the
opposing players moves are again determined by the turn-taking player trying to
maximize its score and so on all the way down the move tree to an end state.

A description for the algorithm, assuming X is the "turn taking player," would look
something like:

 If the game is over, return the score from X's perspective.


 Otherwise get a list of new game states for every possible move
 Create a scores list
 For each of these states add the minimax result of that state to the scores list
 If it's X's turn, return the maximum score from the scores list
 If it's O's turn, return the minimum score from the scores list

Downloaded from STUCOR APP


lOMoARcPSD|26586732

You'll notice that this algorithm is recursive; it flips back and forth between the
players until a final score is found. Let’s walk through the algorithm's execution with
the full move tree, and show why, algorithmically, the instant winning move will be
picked:

 It's X's turn in state 1. X generates the states 2, 3, and 4 and calls minimax on
those states.
 State 2 pushes the score of +10 to state 1's score list, because the game is in an
end state.
 State 3 and 4 are not in end states, so 3 generates states 5 and 6 and calls
minimax on them, while state 4 generates states 7 and 8 and calls minimax on
them.
 State 5 pushes a score of -10 onto state 3's score list, while the same happens
for state 7 which pushes a score of -10 onto state 4's score list.
 State 6 and 8 generate the only available moves, which are end states, and so
both of them add the score of +10 to the move lists of states 3 and 4.
 Because it is O's turn in both state 3 and 4, O will seek to find the minimum
score, and given the choice between -10 and +10, both states 3 and 4 will yield
-10.
 Finally the score list for states 2, 3, and 4 are populated with +10, -10 and -10
respectively, and state 1 seeking to maximize the score will chose the winning
move with score +10, state 2.

That is certainly a lot to take in. And that is why we have a computer execute this
algorithm.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

7. Explain Alpha-Beta Pruning using example. (Dec—04,10, May-10,17)

Alpha-Beta Pruning
o Alpha-beta pruning is a modified version of the minimax algorithm. It is an optimization
technique for the minimax algorithm.
o As we have seen in the minimax search algorithm that the number of game states it has to
examine are exponential in depth of the tree. Since we cannot eliminate the exponent, but we
can cut it to half. Hence there is a technique by which without checking each node of the
game tree we can compute the correct minimax decision, and this technique is called pruning.
This involves two threshold parameter Alpha and beta for future expansion, so it is
called alpha-beta pruning. It is also called as Alpha-Beta Algorithm.
o Alpha-beta pruning can be applied at any depth of a tree, and sometimes it not only prune the
tree leaves but also entire sub-tree.
o The two-parameter can be defined as:

a. Alpha: The best (highest-value) choice we have found so far at any point along the
path of Maximizer. The initial value of alpha is -∞.
b. Beta: The best (lowest-value) choice we have found so far at any point along the path
of Minimizer. The initial value of beta is +∞.
o The Alpha-beta pruning to a standard minimax algorithm returns the same move as the
standard algorithm does, but it removes all the nodes which are not really affecting the final
decision but making algorithm slow. Hence by pruning these nodes, it makes the algorithm
fast.

Condition for Alpha-beta pruning:


The main condition which required for alpha-beta pruning is: α>=β
Key points about alpha-beta pruning:

o The Max player will only update the value of alpha.


o The Min player will only update the value of beta.
o While backtracking the tree, the node values will be passed to upper nodes instead of values
of alpha and beta.
o We will only pass the alpha, beta values to the child nodes.

Working of Alpha-Beta Pruning:

Let's take an example of two-player search tree to understand the working of Alpha-beta pruning

Step 1: At the first step the, Max player will start first move from node A where α= -∞ and β= +∞,
these value of alpha and beta passed down to node B where again α= -∞ and β= +∞, and Node B
passes the same value to its child D.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step 2: At Node D, the value of α will be calculated as its turn for Max. The value of α is compared
with firstly 2 and then 3, and the max (2, 3) = 3 will be the value of α at node D and node value will
also 3.
Step 3: Now algorithms backtrack to node B, where the value of β will change as this is a turn of
Min, Now β= +∞ will compare with the available subsequent nodes value, i.e. min (∞, 3) = 3, hence
at node B now α= -∞, and β= 3. In the next step, algorithm traverse the next successor of Node B
which is node E, and the values of α= -∞, and β= 3 will also be passed.

Step 4: At node E, Max will take its turn, and the value of alpha will change. The current value of
alpha will be compared with 5, so max (-∞, 5) = 5, hence at node E α= 5 and β= 3, where α>=β, so
the right successor of E will be pruned, and algorithm will not traverse it, and the value at node E will
be 5.

Step 5: At next step, algorithm again backtrack the tree, from node B to node A. At node A, the value
of alpha will be changed the maximum available value is 3 as max (-∞, 3)= 3, and β= +∞, these two
values now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to node F.

Step 6: At node F, again the value of α will be compared with left child which is 0, and max(3,0)= 3,
and then compared with right child which is 1, and max(3,1)= 3 still α remains 3, but the node value
of F will become 1.

Step 7: Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here the value of beta will
be changed, it will compare with 1 so min (∞, 1) = 1. Now at C, α=3 and β= 1, and again it satisfies
the condition α>=β, so the next child of C which is G will be pruned, and the algorithm will not
compute the entire sub-tree G.

Step 8: C now returns the value of 1 to A here the best value for A is max (3, 1) = 3. Following is the
final game tree which is the showing the nodes which are computed and nodes which has never
computed. Hence the optimal value for the maximizer is 3 for this example.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Move Ordering in Alpha-Beta pruning:

The effectiveness of alpha-beta pruning is highly dependent on the order in which each node is
examined. Move order is an important aspect of alpha-beta pruning. It can be of two types:

o Worst ordering: In some cases, alpha-beta pruning algorithm does not prune any of the
leaves of the tree, and works exactly as minimax algorithm. In this case, it also consumes
more time because of alpha-beta factors, such a move of pruning is called worst ordering. In
this case, the best move occurs on the right side of the tree. The time complexity for such an
order is O(bm).
o Ideal ordering: The ideal ordering for alpha-beta pruning occurs when lots of pruning
happens in the tree, and best moves occur at the left side of the tree. We apply DFS hence it
first search left of the tree and go deep twice as minimax algorithm in the same amount of
time. Complexity in ideal ordering is O (bm/2).

8. Explain heuristic search with an example. Explain A* search and give


the proof of optimality of A*.
Informed Search Algorithms
 Informed search algorithm contains an array of knowledge such as how far we are from the
goal, path cost, how to reach to goal node, etc. This knowledge help agents to explore less to the
search space and find more efficiently the goal node.
 The informed search algorithm is more useful for large search space. Informed search algorithm
uses the idea of heuristic, so it is also called Heuristic search.
Heuristics function: Heuristic is a function which is used in Informed Search, and it finds the most
promising path. It takes the current state of the agent as its input and produces the estimation of how
close agent is from the goal. The heuristic method, however, might not always give the best solution,
but it guaranteed to find a good solution in reasonable time. Heuristic function estimates how close a
state is to the goal. It is represented by h(n), and it calculates the cost of an optimal path between the
pair of states. The value of the heuristic function is always positive.
Admissibility of the heuristic function is given as:
h (n) <= h*(n)
Here h (n) is heuristic cost, and h*(n) is the estimated cost. Hence heuristic cost should be
less than or equal to the estimated cost.

Pure Heuristic Search:


Pure heuristic search is the simplest form of heuristic search algorithms. It expands nodes based
on their heuristic value h(n). It maintains two lists, OPEN and CLOSED list. In the CLOSED list, it
places those nodes which have already expanded and in the OPEN list, it places nodes which have yet
not been expanded. On each iteration, each node n with the lowest heuristic value is expanded and
generates all its successors and n is placed to the closed list. The algorithm continues unit a goal state
is found.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

In the informed search we will discuss two main algorithms which are given below:

o Best First Search Algorithm(Greedy search)


o A* Search Algorithm

 Best-first Search Algorithm (Greedy Search):


Greedy best-first search algorithm always selects the path which appears best at that moment. It
is the combination of depth-first search and breadth-first search algorithms. It uses the heuristic
function and search. Best-first search allows us to take the advantages of both algorithms. With the
help of best-first search, at each step, we can choose the most promising node. In the best first search
algorithm, we expand the node which is closest to the goal node and the closest cost is estimated by
heuristic function, i.e.
f (n)= g(n).
Where, h(n)= estimated cost from node n to the goal.
The greedy best first algorithm is implemented by the priority queue.

Best first search algorithm:

o Step 1: Place the starting node into the OPEN list.


o Step 2: If the OPEN list is empty, Stop and return failure.
o Step 3: Remove the node n, from the OPEN list which has the lowest value of h(n), and
places it in the CLOSED list.
o Step 4: Expand the node n, and generate the successors of node n.
o Step 5: Check each successor of node n, and find whether any node is a goal node or not. If
any successor node is goal node, then return success and terminate the search, else proceed to
Step 6.
o Step 6: For each successor node, algorithm checks for evaluation function f(n), and then
check if the node has been in either OPEN or CLOSED list. If the node has not been in both
list, then add it to the OPEN list.
o Step 7: Return to Step 2.

Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of both the
algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.

Disadvantages:
o It can behave as an unguided depth-first search in the worst case scenario.
o It can get stuck in a loop as DFS.
o This algorithm is not optimal.

Example:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o Consider the below search problem, and we will traverse it using greedy best-first search. At
each iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in
the below table.

In this search example, we are using two lists which are OPEN and CLOSED Lists. Following are
the iteration for traversing the above example.

Expand the nodes of S and put in the CLOSED list


Initialization: Open [A, B], Closed [S]
Iteration 1: Open [A], Closed [S, B]

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Iteration2: Open[E,F,A],Closed[S,B]
: Open [E, A], Closed [S, B, F]

Iteration 3: Open [I, G, E, A], Closed [S, B, F]


: Open [I, E, A], Closed [S, B, F, G]

Hence the final solution path will be: S----> B----->F----> G

Time Space Complexity: The worst case space complexity of Greedy


best first search is O(bm). Where, m is the maximum depth of the
search space.

Complete: Greedy best-first search is also incomplete, even if the given state space is finite.
Optimal: Greedy best first search algorithm is not optimal.

2.) A* Search Algorithm:


A* search is the most commonly known form of best-first search. It uses heuristic function h(n),
and cost to reach the node n from the start state g(n). It has combined features of UCS and greedy
best-first search, by which it solve the problem efficiently. A* search algorithm finds the shortest path
through the search space using the heuristic function. This search algorithm expands less search tree
and provides optimal result faster. A* algorithm is similar to UCS except that it uses g(n)+h(n)
instead of g(n). In A* search algorithm, we use search heuristic as well as the cost to reach the node.
Hence we can combine both costs as following, and this sum is called as a fitness number.

Complexity: The worst case time complexity of Greedy best first search is O(bm).

Algorithm of A* search:
Step1: Place the starting node in the OPEN list.
Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and stops.
Step 3: Select the node from the OPEN list which has the smallest value of evaluation function (g+h),
if node n is goal node then return success and stop, otherwise
Step 4: Expand node n and generate all of its successors, and put n into the closed list. For each
successor n', check whether n' is already in the OPEN or CLOSED list, if not then compute
evaluation function for n' and place into Open list.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the back
pointer which reflects the lowest g(n') value.
Step 6: Return to Step 2.

Advantages:
o A* search algorithm is the best algorithm than other search algorithms.

o A* search algorithm is optimal and complete.


o This algorithm can solve very complex problems.
Disadvantages:
o It does not always produce the shortest path as it mostly based on heuristics and
approximation.
o A* search algorithm has some complexity issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes in the
memory, so it is not practical for various large-scale problems.

Points to remember:
o A* algorithm returns the path which occurred first, and it does not search for all remaining
paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">

Complete: A* algorithm is complete as long as:


o Branching factor is finite.
o Cost at every action is fixed.

Optimal: A* search algorithm is optimal if it follows below two conditions:


o Admissible: the first condition requires for optimality is that h(n) should be an admissible
heuristic for A* tree search. An admissible heuristic is optimistic in nature.
o Consistency: Second required condition is consistency for only A* graph-search.
If the heuristic function is admissible, then A* tree search will always find the least cost path.

Time Complexity: The time complexity of A* search algorithm depends on heuristic function, and
the number of nodes expanded is exponential to the depth of solution d. So the time complexity is
O(b^d), where b is the branching factor.

Space Complexity: The space complexity of A* search algorithm is O(b^d)

9. Explain the approaches for solving tree structured constraints


satisfaction problem with suitable examples.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Constraint Satisfaction Problems in Artificial Intelligence

This section examines the constraint optimization methodology, another form or real concern
method. By its name, constraints fulfillment implies that such an issue must be solved while adhering
to a set of restrictions or guidelines. Whenever a problem is actually variables comply with stringent
conditions of principles, it is said to have been addressed using the solving multi - objective method.
Wow what a method results in a study sought to achieve of the intricacy and organization of both the
issue.

Three factors affect restriction compliance, particularly regarding

o It refers to a group of parameters, or X.


o D: The variables are contained within a collection several domain. Every variables has a
distinct scope.
o C: It is a set of restrictions that the collection of parameters must abide by.
In constraint satisfaction, domains are the areas wherein parameters were located after the
restrictions that are particular to the task. Those three components make up a constraint satisfaction
technique in its entirety. The pair "scope, rel" makes up the number of something like the
requirement. The scope is a tuple of variables that contribute to the restriction, as well as rel is indeed
a relationship that contains a list of possible solutions for the parameters should assume in order to
meet the restrictions of something like the issue.

Issues with Contains a certain amount Solved

For a constraint satisfaction problem (CSP), the following conditions must be met:

o States area
o Fundamental idea while behind remedy.
The definition of a state in phase space involves giving values to any or all of the parameters, like as
X1 = v1, X2 = v2, etc.

There are 3 methods to economically beneficial to something like a parameter:


1. Consistent or Legal Assignment: A task is referred to as consistent or legal if it complies with
all laws and regulations.
2. Complete Assignment: An assignment in which each variable has a number associated to it
and that the CSP solution is continuous. One such task is referred to as a completed task.
3. A partial assignment is one that just gives some of the variables values. Projects of this nature
are referred to as incomplete assignment.

Domain Categories within CSP


The parameters utilize one of the two types of domains listed below:

o Discrete Domain: This limitless area allows for the existence of a single state with numerous
variables. For instance, every parameter may receive a endless number of beginning states.
o It is a finite domain with continuous phases that really can describe just one area for just one
particular variable. Another name for it is constant area.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Types of Constraints in CSP

Basically, there are three different categories of limitations in regard towards the parameters:

o Unary restrictions are the easiest kind of restrictions because they only limit the value of one
variable.
o Binary resource limits: These restrictions connect two parameters. A value between x1 and x3
can be found in a variable named x2.
o Global Resource limits: This kind of restriction includes a unrestricted amount of variables.

The main kinds of restrictions are resolved using certain kinds of resolution methodologies:

o In linear programming, when every parameter carrying an integer value only occurs in linear
equation, linear constraints are frequently utilized.
o Non-linear Constraints: With non-linear programming, when each variable (an integer value)
exists in a non-linear form, several types of restrictions were utilized.

UNIT – 2 (8 Marks and 16 Marks)


1. How to handle uncertain knowledge with example? And How to represent
knowledge in an uncertain domain? (Dec- 2013) and Define uncertain knowledge,
prior probability and conditional probability. State the Baye’s theorem. How it is
useful for decision making under uncertainty? (May- 2014)

PROBABILISTIC REASONING

Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional logic
with certainty, which means we were sure about the predicates. With this knowledge representation,
we might write A→B, which means if A is true then B is true, but consider a situation where we are
not sure about whether A is true or not then we cannot express this statement, this situation is called
uncertainty. So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of
probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.
 We use probability in probabilistic reasoning because it provides a way to handle the
uncertainty that is the result of someone's laziness and ignorance.
 In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A match
between two teams or two players."
 These are probable sentences for which we can assume that it will happen but not sure about
it, so here we use probabilistic reasoning.

Need of probabilistic reasoning in AI:


o When there are unpredictable outcomes.
o When specifications or possibilities of predicates becomes too large to handle.
o When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding probabilistic
reasoning, let's understand some common terms:

Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always remains
between 0 and 1 that represent ideal uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.

o P(¬A) = probability of a not happening event.


o P(¬A) + P(A) = 1.

Event: Each possible outcome of a variable is called an event.


Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real world.
Prior probability: The prior probability of an event is probability computed before observing new
information.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Posterior Probability: The probability that is calculated after all evidence or information has taken
into account. It is a combination of prior probability and new information.

Conditional probability:
Conditional probability is a probability of occurring an event when another event has already
happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:

Where P (A/ B) = Joint probability of a and B


P (B) = Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:

It can be explained by using the below Venn diagram, where B is occurred event, so sample space
will be reduced to set B, and now we can only calculate event A when event B is already occurred by
dividing the probability of P(A⋀B) by P( B ).

Bayes' theorem in Artificial intelligence

Bayes' theorem:

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines
the probability of an event with uncertain knowledge. In probability theory, it relates the conditional
probability and marginal probabilities of two random events. Baye’s theorem was named after the
British mathematician Thomas Bayes. The Bayesian inference is an application of Baye’s theorem,
which is fundamental to Bayesian statistics. It is a way to calculate the value of P(B|A) with the
knowledge of P(A|B).

 Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

 Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.

 Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B: As from product rule we can write:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,

 P (A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.
 P (B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
 P(A) is called the prior probability, probability of hypothesis before considering the
evidence
 P (B) is called marginal probability, pure probability of evidence.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.

Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This is
very useful in cases where we have a good probability of these three terms and want to determine the
fourth one. Suppose we want to perceive the effect of some unknown cause, and want to compute that
cause, then the Bayes' rule becomes:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Application of Bayes' theorem:

o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.

2. Discuss about Bayesian theory and Bayesian network. (Dec 2017)

Bayesian network

o "A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
o It is also called a Bayes network, belief network, decision network, or Bayesian model.
o Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple
events, we need a Bayesian network. It can also be used in various tasks including prediction,
anomaly detection, diagnostics, automated insight, reasoning, time series prediction,
and decision making under uncertainty.

 Bayesian Network can be used for building models from data and experts opinions, and it consists
of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
o Node C is independent of node A.

The Bayesian network has mainly two components:


o Causal Component
o Actual numbers

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node. Bayesian network is based on Joint probability
distribution and conditional probability. So let's first understand the joint probability distribution:

Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3..
xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:
P (Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

3. Explain about Dempster shafer theory. (May- 2017)


Dempster – Shafer Theory (DST)

o DST is a mathematical theory of evidence based on belief functions and plausible reasoning.
It is used to combine separate pieces of information (evidence) to calculate the probability of
an event.

o DST offers an alternative to traditional probabilistic theory for the mathematical


representation of uncertainty.
o DST can be regarded as, a more general approach to represent uncertainty than the Bayesian
approach. Bayesian methods are sometimes inappropriate

Example:
Let A represent the proposition "Moore is attractive". Then the axioms of probability insist
that P(A) + P(¬A) = 1. Now suppose that Andrew does not even know who "Moore" is, then Also, it
is not fair to say that he disbelieves the proposition. It would therefore be meaningful to denote
Andrew's belief B of B(A) and B(¬A) as both being 0.

Dempster-Shafer Model
The idea is to allocate a number between 0 and 1 to indicate a degree of belief on a proposal as
in the probability framework. However, it is not considered a probability but a belief mass. The
distribution of masses is called basic belief assignment.
In other words, in this formalism a degree of belief (referred as mass) is represented as
a belief function rather than a Bayesian probability distribution.

Example: Belief assignment

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Suppose a system has five members, say five independent states, and exactly one of which is actual.
If the original set is called S, | S | = 5, then the set of all subsets (the power set) is called 2S. If each
possible subset as a binary vector (describing any member is present or not by writing 1 or 0 ),
then 25 subsets are possible, ranging from the empty subset ( 0, 0, 0, 0, 0 ) to the "everything" subset (
1, 1, 1, 1, 1 ).

The "empty" subset represents a "contradiction", which is not true in any state, and is thus assigned a
mass of one; The remaining masses are normalized so that their total is 1. The "everything" subset is
labeled as "unknown"; it represents the state where all elements are present one , in the sense that you
cannot tell which is actual.

Belief and Plausibility

Shafer's framework allows for belief about propositions to be represented as intervals, bounded by
two values, belief (or support) and plausibility:
belief ≤ plausibility
Belief in a hypothesis is constituted by the sum of the masses of all sets enclosed by it (i.e. the sum of
the masses of all subsets of the hypothesis). It is the amount of belief that directly supports a given
hypothesis at least in part, forming a lower bound.

Plausibility is 1 minus the sum of the masses of all sets whose intersection with the hypothesis is
empty. It is an upper bound on the possibility that the hypothesis could possibly happen, up to that
value, because there is only so much evidence that contradicts that hypothesis.

Example:
A proposition say "the cat in the box is dead." Suppose we have belief of 0.5 and plausibility of
0.8 for the proposition.

For example,
Suppose we have a belief of 0.5 for a proposition, say "the cat in the box is dead." This means that
we have evidence that allows us to state strongly that the proposition is true with a confidence of 0.5.
However, the evidence contrary to that hypothesis (i.e. "the cat is alive") only has a confidence of 0.2.
The remaining mass of 0.3 (the gap between the 0.5 supporting evidence on the one hand, and the 0.2
contrary evidence on the other) is "indeterminate," meaning that the cat could either be dead or alive.
This interval represents the level of uncertainty based on the evidence in the system.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

The "neither" hypothesis is set to zero by definition (it corresponds to "no solution"). The
orthogonal hypotheses "Alive" and "Dead" have probabilities of 0.2 and 0.5, respectively.
This could correspond to "Live/Dead Cat Detector" signals, which have respective reliabilities
of 0.2 and 0.5.

Finally, the all-encompassing "Either" hypothesis (which simply acknowledges there is a cat in
the box) picks up the slack so that the sum of the masses is 1. The belief for the "Alive" and
"Dead" hypotheses matches their corresponding masses because they have no subsets; belief
for "Either" consists of the sum of all three masses (Either, Alive, and Dead) because "Alive"
and "Dead" are each subsets of "Either".

The "Alive" plausibility is 1 − m (Dead): 0.5 and the "Dead" plausibility is 1 − m (Alive): 0.8.
In other way, the "Alive" plausibility is m(Alive) + m(Either) and the "Dead" plausibility
is m(Dead) + m(Either).

Finally, the "Either" plausibility sums m(Alive) + m(Dead) + m(Either). The universal
hypothesis ("Either") will always have 100% belief and plausibility—it acts as a checksum of
sorts.

Plausibility in K: It is the sum of masses of set that intersects with K.


i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) + m(b, c) + m(a, c) + m(a, b, c)

Characteristics of Dempster Shafer Theory:


 It will ignorance part such that probability of all events aggregate to 1.
 Ignorance is reduced in this theory by adding more and more evidences.
 Combination rule is used to combine various types of possibilities.

Advantages:
 As we add more information, uncertainty interval reduces.
 DST has much lower level of ignorance.
 Diagnose hierarchies can be represented using this.
 Person dealing with such problems is free to think about evidences.

Disadvantages:
 In this, computation effort is high, as we have to deal with 2 n of sets.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

4. Explain about the exact inference in Bayesian networks. (May- 2015)

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP


lOMoARcPSD|26586732

UNIT – 3 (8 Marks and 16 Marks)

1. Explain the types of learning in machine learning.


Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set
of algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and
on the basis of training, they build the model & perform a specific task.

These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which
are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

Supervised Machine Learning


Supervised machine learning is based on supervision. It means in the supervised learning
technique, we train the machines using the "labelled" dataset, and based on the training, the machine
predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to the
output. More preciously, we can say; first, we train the machine with the input and corresponding
output, and then we ask the machine to predict the output using the test dataset.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Example: Suppose we have an input dataset of cats and dog images. So, first, we will provide
the training to the machine to understand the images, such as the shape & size of the tail of cat and
dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of training,
we input the picture of a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object, such as height, shape, color,
eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of
how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression

a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
o Random Forest Algorithm
o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.

Some popular Regression algorithms are given below:


o Simple Linear Regression Algorithm
o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Disadvantages:
o These algorithms are not able to solve complex tasks.
o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:
oImageSegmentation
oMedicalDiagnosis
oFraud Detection
oSpam detection Speech Recognition

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the machine is
trained using the unlabeled dataset, and the machine predicts the output without any supervision. In
unsupervised learning, the models are trained with the data that is neither classified nor labelled, and
the model acts on that data without any supervision.

Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association

Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset
that does not map with the output.

Applications of Unsupervised Learning


Network Analysis
Recommendation Systems

2. Explain in details about regression models- linear regression


models?

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Linear Regression in Machine Learning


 Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

 Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

 The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o SimpleLinearRegression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

o MultipleLinearregression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:

o PositiveLinearRelationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.

o NegativeLinearRelationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error. The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to calculate
this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and
hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization.

3. Explain in details about regression models- linear classification


models.
Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

4. Explain in detail about support vector machine?


Support Vector Machine Algorithm
 Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

 The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

 SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog.

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 There can be multiple lines/decision boundaries to segregate the classes in n-dimensional


space, but we need to find out the best decision boundary that helps to classify the data points.
This best boundary is known as the hyperplane of SVM.

 The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM works?

Linear SVM:
The working of the SVM algorithm can be understood by using
an example. Suppose we have a dataset that has two tags (green
and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following
way. Since we are in 3-d Space, hence it is looking like a plane
parallel to the x-axis. If we convert it in 2d space with z=1, then it
will become as:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

5. Explain in detail about decision tree and random forest?


Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are

Decision Node and Leaf Node

o Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are
the output of those decisions and do not contain any further branches.

o The decisions or the test are performed on the basis of features of the given dataset. It is a
graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.

o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure. In order to build a tree, we use the CART
algorithm, which stands for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees. Below diagram explains the general structure of a decision tree:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model. Below
are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node. For
the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

UNIT – 4 (8 Marks and 16 Marks)


1.Explain the various ensemble learning techniques?
Ensemble methods are techniques that aim at improving the accuracy of results in models by
combining multiple models instead of using a single model. The combined models increase the
accuracy of the results significantly. This has boosted the popularity of ensemble methods in machine
learning.
Categories of Ensemble Methods

Ensemble methods fall into two broad categories, i.e., sequential ensemble techniques and
parallel ensemble techniques. Sequential ensemble techniques generate base learners in a sequence,
e.g., Adaptive Boosting (AdaBoost). The sequential generation of base learners promotes the
dependence between the base learners. The performance of the model is then improved by assigning
higher weights to previously misrepresented learners.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 In parallel ensemble techniques, base learners are generated in a parallel format,


e.g., random forest. Parallel methods utilize the parallel generation of base learners to
encourage independence between the base learners. The independence of base learners
significantly reduces the error due to the application of averages.

 The majority of ensemble techniques apply a single algorithm in base learning, which results
in homogeneity in all base learners. Homogenous base learners refer to base learners of the
same type, with similar qualities. Other methods apply heterogeneous base learners, giving
rise to heterogeneous ensembles. Heterogeneous base learners are learners of distinct types.

Main Types of Ensemble Methods

1. Bagging

 Bagging, the short form for bootstrap aggregating, is mainly applied in classification
and regression. It increases the accuracy of models through decision trees, which reduces
variance to a large extent. The reduction of variance increases accuracy, eliminating
overfitting, which is a challenge to many predictive models.
 Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a
sampling technique where samples are derived from the whole population (set) using the
replacement procedure. The sampling with replacement method helps make the selection
procedure randomized. The base learning algorithm is run on the samples to complete the
procedure.
 Aggregation in bagging is done to incorporate all possible outcomes of the prediction and
randomize the outcome. Without aggregation, predictions will not be accurate because all
outcomes are not put into consideration. Therefore, the aggregation is based on the probability
bootstrapping procedures or on the basis of all outcomes of the predictive models.

Bagging is advantageous since weak base learners are combined to form a single strong learner
that is more stable than single learners. It also eliminates any variance, thereby reducing the
overfitting of models. One limitation of bagging is that it is computationally expensive. Thus, it can
lead to more bias in models when the proper procedure of bagging is ignored.

2. Boosting

 Boosting is an ensemble technique that learns from previous predictor mistakes to make better
predictions in the future. The technique combines several weak base learners to form one
strong learner, thus significantly improving the predictability of models. Boosting works by
arranging weak learners in a sequence, such that weak learners learn from the next learner in
the sequence to create better predictive models.
 Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and
XGBoost (Extreme Gradient Boosting). AdaBoost uses weak learners in the form of decision
trees, which mostly include one split that is popularly known as decision stumps. AdaBoost’s
main decision stump comprises observations carrying similar weights.

 Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors
correct their successors, thereby increasing the model’s accuracy. New predictors are fit to

Downloaded from STUCOR APP


lOMoARcPSD|26586732

counter the effects of errors in the previous predictors. The gradient of descent helps the
gradient booster identify problems in learners’ predictions and counter them accordingly.

3. Stacking

Stacking, another ensemble method is often referred to as stacked generalization. This technique
works by allowing a training algorithm to ensemble several other similar learning algorithm
predictions. Stacking has been successfully implemented in regression, density estimations, distance
learning, and classifications. It can also be used to measure the error rate involved during bagging.

Variance Reduction
Ensemble methods are ideal for reducing the variance in models, thereby increasing the accuracy
of predictions. The variance is eliminated when multiple models are combined to form a single
prediction that is chosen from all other possible predictions from the combined models. An ensemble
of models combines various models to ensure that the resulting prediction is the best possible, based
on the consideration of all predictions.

Simple Ensemble Techniques


In this section, we will look at a few simple but powerful techniques, namely:

1. Max Voting
2. Averaging
3. Weighted Averaging

2. Explain in detail about k-means algorithm?


K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

 It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters. The
below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which mean reassign each datapoint to the new closest centroid of
each cluster.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.


Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
o Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below: Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the below image:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization. As we need to find the closest cluster, so we will repeat the process by choosing a
new centroid. To choose the new centroids, we will compute the center of gravity of these centroids,
and will find new centroids as below: Next, we will reassign each datapoint to the new centroid. For
this, we will repeat the same process of finding a median line.

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters that
it forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters, but here we are discussing the most appropriate method to find
the number of clusters or value of K. The method is given below:

3. Explain details about KNN algorithm?


K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem, we
need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

4. Explain in detail about Gaussian mixture models and


expectation maximization?

EM algorithm in GMM
In statistics, EM (expectation maximization) algorithm handles latent variables, while GMM is the
Gaussian mixture model.

 Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used to
classify data into different categories based on the probability distribution. Gaussian mixture
models can be used in many different areas, including finance, marketing and so much more.

 Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With GMMs we
assume that the data points are Gaussian distributed; this is a less restrictive assumption than
saying they are circular by using the mean. That way, we have two parameters to describe the
shape of the clusters: the mean and the standard deviation!

 Taking an example in two dimensions, this means that the clusters can take any kind of
elliptical shape (since we have standard deviation in both the x and y directions). Thus, each
Gaussian distribution is assigned to a single cluster. In order to find the parameters of the
Gaussian for each cluster (e.g the mean and standard deviation) we will use an optimization
algorithm called Expectation–Maximization (EM). Take a look at the graphic below as an
illustration of the Gaussians being fitted to the clusters. Then we can proceed on to the process
of Expectation–Maximization clustering using GMMs.

 Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data
sets. GMMs are a generalization of Gaussian distributions and can be used to represent any
data set that can be clustered into multiple Gaussian distributions. The Gaussian mixture
model is a probabilistic model that assumes all the data points are generated from a mix of
Gaussian distributions with unknown parameters.

 A Gaussian mixture model can be used for clustering, which is the task of grouping a set of
data points into clusters. GMMs can be used to find clusters in data sets where the clusters
may not be clearly defined. Additionally, GMMs can be used to estimate the probability that a
new data point belongs to each cluster. Gaussian mixture models are also relatively robust to
outliers, meaning that they can still yield accurate results even if there are some data points
that do not fit neatly into any of the clusters. This makes GMMs a flexible and powerful tool
for clustering data.

 It can be understood as a probabilistic model where Gaussian distributions are assumed for
each group and they have means and co variances which define their parameters. GMM
consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian distribution is
defined as a continuous probability distribution that takes on a bell-shaped curve. Another

Downloaded from STUCOR APP


lOMoARcPSD|26586732

name for Gaussian distribution is the normal distribution. Here is a picture of Gaussian
mixture models:

 GMM has many applications, such as density estimation, clustering, and image segmentation.
For density estimation, GMM can be used to estimate the probability density function of a set
of data points. For clustering, GMM can be used to group together data points that come from
the same Gaussian distribution. And for image segmentation, GMM can be used to partition
an image into different regions.

 Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images. In each of these
examples, the Gaussian mixture model is able to identify clusters in the data that may not be
immediately obvious. As a result, Gaussian mixture models are a powerful tool for data
analysis and should be considered for any clustering task.

Expectation-maximization (EM) method in relation to GMM


In Gaussian mixture models, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model (GMM). The expectation is termed E and
maximization is termed M. Expectation is used to find the Gaussian parameters which are used to
represent each component of gaussian mixture models. Maximization is termed M and it is involved
in determining whether new data points can be added or not.

 The expectation-maximization method is a two-step iterative algorithm that alternates


between performing an expectation step, in which we compute expectations for each data
point using current parameter estimates and then maximize these to produce a new gaussian,
followed by a maximization step where we update our gaussian means based on the maximum
likelihood estimate.

 The EM method works by first initializing the parameters of the GMM, then iteratively
improving these estimates. At each iteration, the expectation step calculates the expectation of
the log-likelihood function with respect to the current parameters. This expectation is then
used to maximize the likelihood in the maximization step. The process is then repeated until
convergence. Here is a picture representing the two-step iterative aspect of the algorithm

The EM algorithm consists of two steps: the E- step and the M-step. Firstly, the model parameters and
the can be randomly initialized. In the E-step, the algorithm tries to guess the value of based on the
parameters, while in the M-step, the algorithm updates the value of the model parameters based on
the guess of the E-step. These two steps are repeated until convergence is reached. The algorithm in
GMM is repeat until convergence.

Optimization uses the Expectation Maximization algorithm, which alternates between two
steps:
1. E-step: Compute the posterior probability over z given our current model - i.e. how much
do we think each Gaussian generates each datapoint.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

2. M-step: Assuming that the data really was generated this way, change the parameters of
each Gaussian to maximize the probability that it would generate the data it is currently
responsible for.

The K-Means Algorithm:


1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster center to the center of gravity of the data assigned to it

The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current model
2. M-step: Maximize the probability that it would generate the data it is currently responsible for.

UNIT – 5 (8 Marks and 16 Marks)


1.Explain in detail about Perceptrons and its types?
Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural network
unit that helps to detect certain input data computations in business intelligence. Perceptron model is
also treated as one of the best and simplest types of Artificial Neural networks. However, it is a
supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and Bias, net sum, and an activation
function.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three
main components. These are as follows:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.
o Weight and Bias:
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of the
associated input neuron in deciding the output. Further, Bias can be considered as the line of intercept
in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or
not. Activation Function can be considered primarily as a step function.

Types of Activation functions:


o Sign function
o Step function, and
o Sigmoid function

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

Single Layer Perceptron Model:


 This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the model.
The main objective of the single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.

 In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After adding
all inputs, if the total sum of all inputs is more than a pre-determined value, the model gets activated
and shows the output value as +1.

 If the outcome is same as pre-determined or threshold value, then the performance of this model is
stated as satisfied, and weight demand does not change. However, this model consists of a few
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.

Multi-Layered Perceptron Model:


A multi-layer perceptron model also has the same model structure but has a greater number of
hidden layers.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in
two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.

Instead of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-linear
patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR,
NOR.

Advantages of Multi-Layer Perceptron:


o A multi-layered perceptron model can be used to solve complex non-linear problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
o In Multi-layer perceptron, computations are difficult and time-consuming.
o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects
each independent variable.
o The model functioning depends on the quality of the training.

Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned
weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0 ; otherwise, f(x)=0
'w' represents real-valued weights vector
'b' represents the bias
'x' represents a vector of input x values.

Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is greater than
zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

2.Explain about ReLu, Hyperparameter tuning, Normalization,


Regularization, Dropout?

i) Hyperparameter Tuning in Deep Learning


The first hyperparameter to tune is the number of neurons in each hidden layer. In this
case, the number of neurons in every layer is set to be the same. It also can be made different. The
number of neurons should be adjusted to the solution complexity. The task with a more complex level
to predict needs more neurons. The number of neurons range is set to be from 10 to 100.
 An activation function is a parameter in each layer. Input data are fed to the input layer,
followed by hidden layers, and the final output layer. The output layer contains the output
value. The input values moving from a layer to another layer keep changing according to the
activation function.

 The activation function decides how to compute the input values of a layer into output values.
The output values of a layer are then passed to the next layer as input values again. The next
layer then computes the values into output values for another layer again. There are 9
activation functions to tune in to this demonstration. Each activation function has its own
formula (and graph) to compute the input values. It will not be discussed in this article.

 The layers of a neural network are compiled and an optimizer is assigned. The optimizer is
responsible to change the learning rate and weights of neurons in the neural network to reach
the minimum loss function. Optimizer is very important to achieve the possible highest
accuracy or minimum loss. There are 7 optimizers to choose from. Each has a different
concept behind it.

 One of the hyperparameters in the optimizer is the learning rate. We will also tune the learning
rate. Learning rate controls the step size for a model to reach the minimum loss function. A
higher learning rate makes the model learn faster, but it may miss the minimum loss function
and only reach the surrounding of it. A lower learning rate gives a better chance to find a
minimum loss function. As a tradeoff lower learning rate needs higher epochs, or more time
and memory capacity resources.

ii) ReLu - Rectified Linear Unit


A Rectified Linear Unit is a form of activation function used commonly in deep learning models.
In essence, the function returns 0 if it receives a negative input, and if it receives a positive value, the
function will return back the same positive value. The function is understood as:
f(x)=max(0,x)
The rectified linear unit, or ReLU, allows for the deep learning model to account for non-linearities
and specific interaction effects. The image above displays the graphic representation of the ReLU

Downloaded from STUCOR APP


lOMoARcPSD|26586732

function. Note that the values for any negative X input result in an output of 0, and only once positive
values are entered does the function begin to slope upward.
How does a Rectified Linear Unit work?
To understand how a ReLU works, it is important to understand the effects it has on variable
interaction effects. An interaction effect is when a variable affects a prediction depending on the value
of associated variables. For example, comparing IQ scores of two different schools may have
interaction effects of IQ and age. The IQ of a student in high school is better than the IQ of an
elementary school student, as age and IQ interact with each other regardless of the school. This is
known as an interaction effect and ReLUs can be applied to minimize interaction effects. For
example, if A=1 and B=2, and both have the respective associated weights of 2 and 3, the function
would be, f(2A+3B). If A increases, the output will increase as well. However, if B is a large negative
value, the output will be 0.

Benefits of using the ReLU function

Its simplicity leads it to be a relatively


cheap function to compute. As there is no
complicated math, the model can be
trained and run in a relatively short time.
Similarly, it converges faster, meaning the
slope doesn't plateau as the value for X
gets larger. This vanishing gradient
problem is avoided in ReLU, unlike
alternative functions such as sigmoid or tanh. Lastly, ReLU is sparsely activated because for all
negative inputs, the output is zero. Sparsity is the principle that specific functions only are activated
in concise situations. This is a desirable feature for modern neural networks, as in a sparse network it
is more likely that neurons are appropriately processing valuable parts of a problem. For example, a
model that is processing images of fish may contain a neuron that is specialized to identity fish eyes.
That specific neuron would not be activated if the model was processing images of airplanes instead.
This specified use of neuron functions accounts for network sparsity.

iii) Regularization :
 Regularization is one of the most important concepts of machine learning. It is a technique to
prevent the model from overfitting by adding extra information to it. Sometimes the machine
learning model performs well with the training data but does not perform well with the test data. It
means the model is not able to predict the output when deals with unseen data by introducing noise in
the output, and hence the model is called overfitted. This problem can be deal with the help of a
regularization technique.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model. It mainly regularizes or reduces the coefficient of features toward zero.
In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the
same number of features."

How does Regularization Work?


Regularization works by adding a penalty or complexity term to the complex model. Let's consider
the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted X1, X2, …Xn are the features for
Y.β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the
bias of the model, and b represents the intercept.. Linear regression models try to optimize the β0 and
b to minimize the cost function. The equation for the cost function for the linear model is given
below:
Now, we will add a loss function and optimize parameter to make the model that can predict the
accurate value of Y. The loss function for the linear regression is called as RSS or Residual sum of
squares.

Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression

Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of
bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

Downloaded from STUCOR APP


lOMoARcPSD|26586732

o In the above equation, the penalty term regularizes the coefficients of the model, and hence
ridge regression reduces the amplitudes of the coefficients that decreases the complexity of
the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes
the cost function of the linear regression model. Hence, for the minimum value of λ, the
model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It
stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression
will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the
feature selection.

Key Difference between Ridge Regression and Lasso Regression


o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the
features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.

3.Explain about Error Backpropagation?


Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for
errors working back from output nodes to input nodes. It is an important mathematical tool for
improving the accuracy of predictions in data mining and machine learning. Essentially,
backpropagation is an algorithm used to calculate derivatives quickly.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

There are two leading types of backpropagation networks:


1. Static backpropagation:
Static backpropagation is a network developed to map static inputs for static outputs. Static
backpropagation networks can solve static classification problems, such as optical character
recognition (OCR).

2. Recurrent backpropagation.
The recurrent backpropagation network is used for fixed-point learning. Recurrent
backpropagation activation feeds forward until it reaches a fixed value.

What is a backpropagation algorithm in a neural network?


Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent
with respect to weight values for the various inputs. By comparing desired outputs to achieved system
outputs, the systems are tuned by adjusting connection weights to narrow the difference between the
two as much as possible. The algorithm gets its name because the weights are updated backward,
from output to input.

Advantages
 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior knowledge about the
network.
 It is a standard process that usually works well.
 It is user-friendly, fast and easy to program.
 Users do not need to learn any special functions.

DISADVANTAGES
It prefers a matrix-based approach over a mini-batch approach.
 Data mining is sensitive to noise and irregularities.
 Performance is highly dependent on input data.
 Training is time- and resource-intensive.

Features of Backpropagation:
1. it is the gradient descent method as used in the case of simple perceptron network with the
differentiable unit.
2. it is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
3. training is done in the three stages :
 the feed-forward of input training pattern
 the calculation and backpropagation of the error
 updation of the weight

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the
network operates on. It Compares generated output to the desired output and generates an error
report if the result does not match the generated output vector. Then it adjusts the weights
according to the bug report to get your desired output.

Backpropagation Algorithm:
Step 1: Inputs X, arrive through the preconnected path.
Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output
layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.

Fig: Error backpropagation

 x = inputs training vector x=(x 1,x2,…………xn).


 t = target vector t=(t 1,t2……………tn).
 δk = error at output unit.
 δj = error at hidden layer.
 α = learning rate.
 V0j = bias of hidden unit j.

Training Algorithm :
Step 1: Initialize weight to small random values.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Step 2: While the steps stopping condition is to be false do step 3 to 10.


Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmits the signal x i signal to all the units.
Step 5: Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function z j = f(zinj) and sends this signals to all units in the layer about
i.e output units
For each output l=unit y k = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the output signals.
yk = f(yink)

Backpropagation Error :
Step 6: Each output unit y k (k=1 to n) receives a target pattern corresponding to an input pattern
then error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Z j (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj

Updation of weight and bias :


Step 8: Each output unit y k (k=1 to m) updates its bias and weight (j=1 to a). The weight correction
term is given by :
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
for each hidden unit z j (j=1 to a) update its bias and weights (i=0 to n) the weight
connection term
Δ vij = α δj xi
and the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j(new) = v0j(old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error,
number of epochs.

4.Explain detail about activation functions?


ACTIVATION FUNCTIONS
The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it. The purpose of the activation function

is to introduce non-linearity into the output of a neuron.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 In a neural network, we would update the weights and biases of the neurons on the basis of
the error at the output. This process is known as back-propagation . Activation functions make
the back-propagation possible since the gradients are supplied along with the error to update
the weights and biases.

 A neural network without an activation function is essentially just a linear regression


model. The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks.

Calculation at Output layer


z(2) = (W(2) * [W(1)X + b(1)]) + b(2)
z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
Final output : z(2) = W*X + b
which is again a linear function
Variants of Activation Function

Linear Function

 Equation : Linear function has the equation similar to as of a straight line i.e. y = x
 No matter how many layers we have, if all are linear in nature, the final activation function
of last layer is nothing but just a linear function of the input of first layer.
 Range : -inf to +inf
 Uses: Linear activation function is used at just one place i.e. output layer.
 Issues: If we will differentiate linear function to bring non-linearity, result will no more
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.

For example: Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net must
have any non-linear function at hidden layers.

Sigmoid Function
 It is a function which is plotted as ‘S’ shaped graph.
 Equation : A = 1/(1 + e-x)

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Nature: Non-linear. Notice that X values lies between -2 to 2, Y values are very steep. This
means, small changes in x would also bring about large changes in the value of Y.
 Value Range : 0 to 1
 Uses: Usually used in output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be 1 if value is greater than 0.5 and 0 otherwise.

Tanh Function
 The activation that works almost always better than sigmoid function is Tanh function also
knows as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
 Equation :-

 Value Range :- -1 to +1
 Nature :- non-linear
 Uses: - Usually used in hidden layers of a neural network as its values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer much
easier.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

RELU Function
 It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of neural network.
 Equation: - A(x) = max (0, x). It gives an output x if x is positive and 0 otherwise.
 Value Range :- [0, inf)
 Nature: - non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
 Uses: - ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.

Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying to
handle multi- class classification problems.
 Nature :- non-linear

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Uses: - Usually used when trying to handle multiple classes. The softmax function was
commonly found in the output layer of image classification problems. The softmax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of
the outputs.

 Output: - The softmax function is ideally used in the output layer of the classifier where
we are actually trying to attain the probabilities to define the class of each input.
 The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.
 If your output is for binary classification then, sigmoid function is very natural choice for
output layer.
 If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.

5.Explain in detail about gradient descent optimization?


Gradient descent optimization
 Gradient descent is an optimization algorithm in gadget mastering used to limit a feature with the
aid of iteratively moving towards the minimal fee of characteristic.

 We essentially use this algorithm when we have to locate the least possible values which could
fulfill a given free function. In gadget getting to know, greater regularly that not we try to limit loss
features (like mean squared error). By minimizing the loss characteristic, we will improve our model
and gradient descent is one of the most popular algorithms used for this cause.

 The graph above shows how exactly a gradient descent set of rules works.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 We first take a factor in the value function and begin shifting in steps in the direction of the
minimum factor. The size of the step, or how quickly we ought to converge to the minimum factor is
defined by learning rate.

 We can cowl more location with better learning fee but at the risk of overshooting the minima. On
the opposite hand, small steps/ smaller gaining knowledge of charges will eat a number of times to
attain the lowest point.

 Now, the direction where in algorithm has to transport is also important. We calculate this by way
of using derivatives. You need to be familiar with derivatives from calculus. A spinoff is largely
calculated because the slope of the graph at any specific factor. We get that with the aid of finding the
tangent line to the graph at that point. The extra sleep the tangent, would suggest that more steps
would be needed to reach minimum point; much less steep might suggest lesser steps are required to
reach the minimum factor.

Fig: Gradient descent optimization

Stochastic gradient descent


The word stochastic means a system or a process that is linked with a random probability. Hence, in
stochastic gradient descent, a few samples are selected randomly instead of the whole data set for
each iteration.

 Stochastic gradient descent is a type of gradient descent that runs one training example per
iteration. It processes a training epoch for each example within a dataset and updates each
training example’s parameters one at a time.
 As it requires only one training example at a time, hence it is easier to store in allocated
memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

 Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it
can be helpful in finding the global minimum and also escaping the local minimum.

Advantages of stochastic gradient descent:


It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large dataset.

Disadvantages of stochastic gradient descent:


SGD require a number of hyperparameters such as the regularization parameter and the
number of iterations.
SGD is sensitive to feature scaling.

Downloaded from STUCOR APP


lOMoARcPSD|26586732

Downloaded from STUCOR APP

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy