0% found this document useful (0 votes)
2 views

AIML Module-03

The document discusses the basics of learning theory in the context of artificial intelligence and machine learning, explaining how machines can learn from data through various algorithms. It outlines different types of learning, such as rote learning, inductive learning, and reinforcement learning, and introduces concepts like computational learning theory and hypothesis space. Additionally, it emphasizes the importance of designing learning systems and concept learning, which involves inferring general concepts from training samples.

Uploaded by

bashaa4456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AIML Module-03

The document discusses the basics of learning theory in the context of artificial intelligence and machine learning, explaining how machines can learn from data through various algorithms. It outlines different types of learning, such as rote learning, inductive learning, and reinforcement learning, and introduces concepts like computational learning theory and hypothesis space. Additionally, it emphasizes the importance of designing learning systems and concept learning, which involves inferring general concepts from training samples.

Uploaded by

bashaa4456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Artificial Intelligence and Machine Learning (21CS54)

MODULE-03

BASICS OF LEARNING THEORY


Learning is a process by which one can acquire knowledge and construct new ideas or concepts based
on the experiences.
Machine learning is an intelligent way of learning general concept from training examples without
writing a program.
There are many machine learning algorithms through which computers can intelligently learn from past
data or experiences, identify patterns, and make predictions when new data is fed.

INTRODUCTION TO LEARNING AND ITS TYPES


 The process of acquiring knowledge and expertise through study, experience, or being taught called
as learning. Generally, humans learn in different ways.
 To make machines learn, we need simulate the strategies of human learning in machines. But, will
the computers learn? What sort of tasks can the computers learn?
 This depends on the nature of problems that the computers can solve.
 There are two kinds of problems - well-posed and ill-posed. Computers can solve only well-posed
problems, as these have well-defined specifications and have the following components inherent to it.
1. Class of learning tasks (T)
2. A measure of performance (P)
3. A source of experience (E)
 The standard definition of learning proposed by Tom Mitchell is that a program can lean from E for
the task T, and P improves with experience E.
 Let us formalize the concept of learning as follows:
 Let x be the input and X be the input space, which is the set of all inputs, and Y is the output space,
which is the set of all possible outputs, that is, yes/no.
 Let D be the input dataset with examples, (x,y,), (x,y),..., (x,y) for n inputs.
 Let the unknown target function be f: X → Y, that maps the input space to output space
 The objective of the learning program is to pick a function, g: X → Y to approximate hypothesis All
the possible formulae form a hypothesis space.
 Let H be the set of all formulae from which the learning algorithm chooses. The choice is good when
the hypothesis g replicates for all samples. This is shown in Figure 3.1.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 1


Artificial Intelligence and Machine Learning (21CS54)

 It can be observed that training samples and target function are dependent on the given problem.
The learning algorithm and hypothesis set are independent of the given problem. Thus learning
model is informally the hypothesis set and learning algorithm. Thus, learning model cat be stated
as follows:
Learning Model = Hypothesis Set + Learning Algorithm

 Let us assume a problem of predicting a label for a given input data. Let D be the input dataset with
both positive and negative examples. Let y be the output with class 0 or 1.
 The simple learning model can be given as:
∑ > Threshold, belongs to class 1 and
∑ < Threshold, belongs to another class
 This can be put into a single equation as follows:
h(x) = sign((∑ ) + b)
where, x1,x2,x3 are the components of the input vector, w1,w2,w3,..., wd are the weights and +1 and
-1 represent the class. This simple model is called perception model. One can simplify this by making
w0 = b and fixing it as 1, then the model can further be simplified as:
h(x) = sign(wT x).
 This is called perception learning algorithm.

Learning Types
There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by
memorizing without understanding the logic or concept. Although rote learning is basically
learning by repetition, in machine learning perspective, the learning occurs by simply comparing
with the existing knowledge for the same input data and producing the output if present.
2. Learn by examples also called as learn by experience or previous knowledge acquired at some
time, is like finding an analogy, which means performing inductive learning from observations

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 2


Artificial Intelligence and Machine Learning (21CS54)

that formulate a general concept. Here, the learner learns by inferring a general rule from the set
of observations or examples. Therefore, inductive learning is also called as discovery learning.
3. Learn by being taught by an expert or a teacher, generally called as passive learning However,
there is a special kind of learning called active learning where the learner can interactively query
a teacher/expert to label unlabelled data instances with the desired outputs.
4. Learning by critical thinking, also called as deductive learning, deduces new facts or
conclusion from related known facts and information.
5. Self-learning, also called as reinforcement learning, is a self-directed learning that normally
learns from mistakes punishments and rewards.
6. Learning to solve problems is a type of cognitive learning where learning happens in the mind
and is possible by devising a methodology to achieve a goal. The learning happens either
directly from the initial state by following the steps to achieve the goal or indirectly by inferring
the behaviour.
7. Learning by generalizing explanations, also called as explanation-based learning (EBL), is
another learning method that exploits domain knowledge from experts to improve the accuracy
of learned concepts by supervised learning.

INTRODUCTION TO COMPUTATION LEARNING THEORY


 There are many questions that have been raised by mathematicians and logicians over the time
taken by computers to learn.
 Some of the questions are as follows:
1. How can a learning system predict an unseen instance?
2. How do the hypothesis h is close to f, when hypothesis fitself is unknown?
3. How many samples are required?
4. Can we measure the performance of a learning system?
5. Is the solution obtained local or global?
 These questions are the basis of a field called 'Computational Learning Theory' or in short
(COLT).
 It is a specialized field of study of machine learning. COLT deals with formal methods used for
learning systems.
 It deals with frameworks for quantifying learning tasks and learning algorithms. It provides a
fundamental basis for study of machine learning.
 Computational Learning Theory uses many concepts from diverse areas such as Theoretical
Computer Science, Artificial Intelligence and Statistics.
 The core concept of COLT is the concept of learning framework. One such important framework is
PAC. COLT focuses on supervised learning tasks. Since the complexity of analyzing is difficult,
normally, binary classification tasks are considered for analysis

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 3


Artificial Intelligence and Machine Learning (21CS54)

DESIGN OF A LEARNING SYSTEM


 A system that is built around a learning algorithm is called a learning system. The design of systems
focuses on these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation

1. Training Experience
 Let us consider designing of a chess game.
 In direct experience, individual board states and correct moves of the chess game are given
directly.
 In indirect system, the move sequences and results are only given.
 The training experience also depends on the presence of a supervisor who can label all valid
moves for a board state.
 In the absence of a supervisor, the game agent plays against itself and learns the good moves.
 If the training samples and testing samples have the same distribution, the results would be
good.

2. Determine the Target Function


 The next step is the determination of a target function.
 In this step, the type of knowledge that needs to be learnt is determined.
 In direct experience, a board move is selected and is determined whether it is a good move or
not against all other moves.
 If it is the best move, then it is chosen as: B-> M, where, B and M are legal moves.
 In indirect experience, all legal moves are accepted and a score is generated for each. The move
with largest score is then chosen and executed.
3. Determine the Target Function Representation
 The representation of knowledge may be a table, collection of rules or a neural network. The
linear combination of these factors can be coined as:

V=w0 + w1x1 + w2x2 + w3x3

where, x1, x2, and x3, represent different board features and w0, w1 and w3, represent weights.

4. Choosing an Approximation Algorithm for the Target Function


 The focus is to choose weights and fit the given training samples effectively. The aim is to
reduce the error given as:

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 4


Artificial Intelligence and Machine Learning (21CS54)

E ∑ [ ( ) ̂( ) ]

Here, b is the sample and V(b) is the predicted hypothesis. The approximation is carried out as
 Computing the error as the difference between trained and expected hypothesis. Let error
be error(b).
 Then, for every board feature xi, the weights are updated as:
Wi = Wi + μ * error(b) * xi
Here, μ is the constant that moderates the size of the weight update.
Thus, the learning system has the following components:
i. A Performance system to allow the game to play against itself.
ii. A Critic system to generate the samples.
iii. A Generalizer system to generate a hypothesis based on samples.
iv. An Experimenter system to generate a new system based on the currently learnt function.
This is sent as input to the performance system.

INTRODUCTION TO CONCEPT LEARNING


 Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general
concept or deriving a category from the given training samples.
 It is a process of abstraction and generalization from the data.
 Concept learning helps to classify an object that has a set of common, relevant features. Thus it
helps a learner compare and contrast categories based on the similarity and association of positive
and negative instances in the training data to classify an object.
 The learner tries to simplify by observing the common features from the training samples and then
apply this simplified model to the future samples. This task is also known as learning from
experience.
 Each concept or category obtained by learning is a Boolean valued function which takes a true or
false value.
 The way of learning categories for object and to recognize new instances of those categories is
called as concept learning:
 Concept Learning: It is formally defined as inferring a Boolean valued function by processing
training instances.
 Concept learning requires three things:
1. Input - Training dataset which is a set of training instances, each labeled with the name of a
concept or category to which it belongs. Use this past experience to train and build the model.
2. Output - Target concept or Target function f. It is a mapping function f(x) from input x to
output y. It is to determine the specific features or common features to identify an object.
3. Test-New instances to test the learned model.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 5


Artificial Intelligence and Machine Learning (21CS54)

 Formally, Concept learning is defined as "Given a set of hypotheses, the learner searches through
the hypothesis space to identify the best hypothesis that matches the target concept".
 Consider the following set of training instances shown in Table 3.1.
Table: 3.1- Enjoy sport data set

Sky Air_Temp Humidity Wind Water Forecast Enjoy_Sport

sunny warm normal strong warm same yes

sunny warm high strong warm same yes

rainy cold high strong warm change no

sunny warm high strong cool change yes


 Here, in this set of training instances, the independent attributes considered are 'Sky', 'Air Temp‟,
'Humidity', 'Wind', „Water', 'Forecast'.
 The dependent attribute is 'Enjoy_Sport‟. The target concept is to identify the person enjoys the
sports or not.
Representation of a Hypothesis
 A hypothesis 'h' approximates a target function 'f' to represent the relationship between the
independent attributes and the dependent attribute of the training instances.
 The hypothesis is the predicted approximate model that best maps the inputs to outputs.
 Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part.
 For example, (Sky = Sunny) (Air Temp = Warm)....
 The set of hypothesis in the search space is called as hypotheses. Hypotheses are the plural form of
hypothesis. Generally 'H' is used to represent the hypotheses and 'h' is used to represent a candidate
hypothesis.
 Each attribute condition is the constraint on the attribute which is represented as attribute-value
pair.
 In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either '?'
or 'ϕ' or can hold a single value.
 "?" denotes that the attribute can take any value [e.g.,Wind= ?]
 "ϕ" denotes that the attribute cannot take any value, i.e., it represents a null value [e.g.,
Water= ϕ]
 Single value denotes a specific single value from acceptable values of the attribute, i.e., the
attribute 'Humidity' can take a value as 'high‟'
 For example, a hypothesis 'h' will look like,
h = <Sunny, Warm, ?, Strong, ?, ?>

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 6


Artificial Intelligence and Machine Learning (21CS54)

 The task is to predict the best hypothesis of the target concept. The most general hypothesis can
allow any value for each of the attribute.
 It is represented as <?, ? , ? ,? , ? , ?> this hypothesis indicates that any day person can enjoy
sports.

 The most specific hypothesis will not allow any value for each attribute < ø, ø, ø, ø, ø, ø, ø
>. This hypothesis indicates that none of the day person can enjoy sports.
 Concept learning is also called as Inductive Learning that tries to induce a general function from
specific training instance.

Hypothesis Space
 Hypothesis space is the set of all possible hypotheses that approximates the target function f. In
other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning
algorithm would determine the best possible hypothesis that would best describe the target
function or best fit the outputs.
 Generally, a hypothesis representation language represents a larger hypothesis space.
 Every machine learning algorithm would represent the hypothesis space in a different manner
about the function that maps the input variables to output variables.
 The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
 The subset of hypothesis space that is consistent with all-observed training instances is called as
Version Space. Version space represents the only hypotheses that are used for the classification.
 For example, each of the attribute given in the Table 3.1 (Enjoy sport table) has the following
possible set of values.
 Sky= Sunny, Rainy
 Air Temp = Warm, Cold
 Humidity = Normal, High
 Wind= Strong
 Water =Warm, Cool
 Forecast= Same, Change
 Considering these values for each of the attribute, there are (2*2*2*1*2*2) = 32 distinct
instances covering all the 4 instances in the training dataset.
 So, we can generate (4*4*4*3*4*4) = 3072 distinct hypotheses when including two more values
[?, ] for each of the attribute. However, any hypothesis containing one or more ϕ symbols
represents the empty set of instances; that is, it classifies every instance as negative instance.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 7


Artificial Intelligence and Machine Learning (21CS54)

Therefore, there will be 1+ (3*3*3*2*3*3) = 487 distinct hypotheses by including only "?' for
each of the attribute and one hypothesis representing the empty set of instances.
 Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to
search for the best hypothesis from the set of hypotheses.
 Hypothesis ordering is also important wherein the hypotheses are ordered from the most specific
one to the most general one in order to restrict searching the hypothesis space exhaustively.

Heuristic Space Search


 Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem by
iteratively improving the hypothesis/solution based on a given heuristic function or a cost
measure.
 Heuristic search methods will generate a possible hypothesis that can be a solution in the
hypothesis space or a path from the initial state.
 This hypothesis will be tested with the target function or the goal state to see if it is a real
solution.
 If the tested hypothesis is a real solution, then it will be selected. This method generally
increases the efficiency because it is guaranteed to find a better hypothesis but may not be the
best hypothesis.
 It is useful for solving tough problems which could not solved by any other method.
 Several commonly used heuristic search methods are hill climbing methods, constraint
satisfaction problems, best-first search, simulated-annealing, A* algorithm, and genetic
algorithms

Generalization and Specialization


By generalization of the most specific hypothesis and by specialization of the most general hypothesis,
the hypothesis space can be searched for an approximate hypothesis that matches all positive instances
but does not match any negative instance.
Searching the Hypothesis Space
There are two ways of learning the hypothesis, consistent with all training instances from the large
hypothesis space.
1. Specialization - General to Specific learning
2. Generalization - Specific to General learning
Generalization - Specific to General Learning: This learning methodology will search through the
hypothesis space for an approximate hypothesis by generalizing the most specific hypothesis. .
Example: Consider the training instances shown in Table 3.2 and illustrate Specific to General
Learning.
Sl. No Horns Tail Tusks Paws Fur Color Hooves Size Elephant

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 8


Artificial Intelligence and Machine Learning (21CS54)

1 No Short Yes No No Black No Big Yes


2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes
Solution: We will start from the most specific hypothesis to determine the most restrictive
specialization. Consider only the positive instances and generalize the most specific hypothesis. Ignore
the negative instances.
This learning is illustrated as follows:
The most specific hypothesis is taken now, which will not classify any instance to true.

h=< ø, ø, ø, ø, ø, ø, ø, ø >
Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified
by the hypothesis h1.
I1: <No, Short, Yes, No, No, Black, No, Big> →Yes [positive]
h1: <No, Short, Yes, No, No, Black, No, Big>

Second instance I2; it is a negative instance so ignore it.


I2: <Yes, Short, No, No, No, Brown, Yes, Medium> →No [negative]
h2: <No, Short, Yes, No, No, Black, No, Big>

Third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it, The resulting h3 is
generalized
I3: <No, Short, Yes, No, No, Black, No, Medium> →Yes [positive]
h3: <No, Short, Yes, No, No, Black, No, ? >

Fourth instance I3; it is a negative instance so ignore it.


I4: <No, Long, No, Yes, Yes, White, No, Medium> →No [negative]
h4: <No, Short, Yes, No, No, Black, No, ? >

Fifth instance I5 is positive instance so h4 is further generalized to h5


I5: <No, Short, Yes, Yes, Yes, Black, No, Big> →Yes [positive]
h3: <No, Short, Yes, ?, ?, Black, No, ? >

Specialization- General to specific Learning: The learning methodology will search through the
hypothesis space for an approximate hypothesis by specializing the most general hypothesis.

Example: Consider the training instances shown in Table 3.2 and illustrate General to Specific
Learning.
Sl. No Horns Tail Tusks Paws Fur Color Hooves Size Elephant
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 9


Artificial Intelligence and Machine Learning (21CS54)

4 No Long No Yes Yes White No Medium No


5 No Short Yes Yes Yes Black No Big Yes

Solution: Start from the most general hypothesis which will make true all positive and negative
instances.

Initially,
h = <?, ?, ? ,? ,? ,? ,? ,?>
h is more general to classify all instances to true.

Hypothesis Space Search by Find-S Algorithm

Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with
the positive instances in the training dataset. Obviously, it will also be consistent with the negative
instances.
Thus, this algorithm considers only the positive instances and eliminates negative instances while
generating the hypothesis. It initially starts with the most specific hypothesis.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 10


Artificial Intelligence and Machine Learning (21CS54)

Algorithm: Find-S
Input: Positive instances in the Training dataset
Output: Hypothesis 'h'
Step 1: Initialize 'h' to the most specific hypothesis.

h=< ø, ø, ø, ø, ø, ø, ø >
Step 2: Generalize the initial hypothesis for the first positive instance [Since 'h' is more specific). 3
Step 3: For each subsequent instances:
If it is a positive instance,
Check for each attribute value in the instance with the hypothesis 'h'.
If the attribute value is the same as the hypothesis value, then do nothing
Else if the attribute value is different than the hypothesis value, change it to'?' in
'h'.
Else if it is a negative instance, Ignore it.

Example: Apply Find-S algorithm to the below table and find the Maximally Specific Hypothesis.

Sky Air_Temp Humidity Wind Water Forecast Enjoy_Sport

sunny warm normal strong warm same yes

sunny warm high strong warm same yes

rainy cold high strong warm change no

sunny warm high strong cool change yes

Step 1: Initialize h to most specific hypothesis in H


h0 = (ø, ø, ø, ø, ø, ø, ø)

Step 2: Iteration-1
Given instance X1 = <Sunny, Warm, Normal, Strong, Warm, Same> →Yes [positive]
h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

Step 2: Iteration-2
Given instance X2 = <Sunny, Warm, High, Strong, Warm, Same> →Yes [positive]
h2 = <Sunny, Warm, ?, Strong, Warm, Same>

Step 2: Iteration-3
Given instance X3 = <Rainy, Cold, High, Strong, Warm, Change> →No [negative]
Since X3 is Negative this example is ignored, therefore
h3 = <Sunny, Warm, ?, Strong, Warm, Same>

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 11


Artificial Intelligence and Machine Learning (21CS54)

Step 2: Iteration-4
Given instance X4 = <Sunny, Warm, High, Strong, Cool, Change> →Yes [positive]
h4 = <Sunny, Warm, ?, Strong, ?, ?>
Step 3:
The final maximally specific hypothesis is <Sunny, Warm, ?, Strong, ?, ?>

Example 2:

Instance Citations Size in Library Price Editions Buy


1 Some Small No Affordable Many No
2 Many Big No Expensive One Yes
3 Some Big Always Expensive Few No
4 Many Medium No Expensive Many Yes
5 Many Small No Affordable Many Yes

Step 1:
h0 = (ø, ø, ø, ø, ø)

Step 2: Iteration 1
Given instance X1 = < some, small, no, expensive, many > →No [negative]
Since X1 is Negative this example is ignored, therefore
h1 = < ø, ø, ø, ø, ø >

Step 2: Iteration 2
Given instance X2 = < many, big, no, expensive, one > →Yes [positive]
h2 = < many, big, no, expensive, one >

Step 2: Iteration 3
X3 = < some, big, always, expensive, few > →No [negative]

Since X3 is Negative this example is ignored, therefore


h3 = <many, big, no, expensive, one>

Step 2: Iteration 4
X4 = <many, medium, no, expensive, many> →Yes [positive]
h4 = <many, ?, no, expensive, ?>

Step 2: Iteration 5
X5 = <many, small, no, affordable, many> →Yes [positive]
h5 = (many, ?, no, ?, ?)

Step 3:

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 12


Artificial Intelligence and Machine Learning (21CS54)

Final Maximally Specific Hypothesis is: h5 = (many, ?, no, ?, ?)

Example 3:
Sl. No Horns Tail Tusks Paws Fur Color Hooves Size Elephant
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes

Step 1:
h0 = (ø, ø, ø, ø, ø)

Step 2: Iteration 1
Given instance X1 = < No, Short, Yes, No, No, Black, No, Big > →Yes [positive]
h1 = < No, Short, Yes, No, No, Black, No, Big>

Step 2: Iteration 2
Given instance X2 = < Yes, Short, No, No, No, Brown, Yes, Medium> →No [negative]
Since X2 is Negative this example is ignored, therefore
h2 = < No, Short, Yes, No, No, Black, No, Big>

Step 2: Iteration 3
X3 = < No, Short, Yes, No, No, Black, No, Medium> →Yes [positive]
h3 = < No, Short, Yes, No, No, Black, No, ? >

Step 2: Iteration 4
X4 = < No, Long, No, Yes, Yes, White, No, Medium> →No [negative]
Since X4 is Negative this example is ignored, therefore
h4 = < No, Short, Yes, No, No, Black, No, ? >

Step 2: Iteration 5
X5 = <No, Short, Yes, Yes, Yes, Black, No, Big> →Yes [positive]
h5 = < No, Short, Yes, ?, ?, Black, No, ? >

Step 3:
Final Maximally Specific Hypothesis is: h5 = < No, Short, Yes, ?, ?, Black, No, ? >

Limitations of Find- S algorithm


1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring all
negative instances. As long as the training dataset is consistent, the hypothesis found by this algorithm
may be consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses that
are consistent with the training dataset.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 13


Artificial Intelligence and Machine Learning (21CS54)

3. Many times, the training dataset may contain some errors; hence such inconsistent data instances can
mislead this algorithm in determining the consistent hypothesis since it ignores negative instances.

Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples. To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm
was proposed to output the set of all hypotheses consistent with the training dataset.

Version Spaces
The version space contains the subset of hypotheses from the hypothesis space that is consistent with all
training instances in the training dataset.
List-Then-Eliminate Algorithm
 The principle idea of this learning algorithm is to initialize the version space to contain all
hypotheses and then eliminate any hypothesis that is found inconsistent with any training instances.
 Initially, the algorithm starts with a version space to contain all hypotheses scanning each training
instance.
 The hypotheses that are inconsistent with the training instance are eliminated.
 Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.

Algorithm: List-Then-Eliminate
Input: Version Space - a list of all hypotheses
Output: Set of consistent hypotheses
Step 1: Initialize the version space with a list of hypotheses.
Step 2: For each training instance,
remove from version space any hypothesis that is inconsistent.

 This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this
algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.

Version Spaces and the Candidate Elimination Algorithm


 Version space learning is to generate all consistent hypotheses around. This algorithm computes the
version space by the combination of the two cases namely,
 Specific to General learning - Generalize S to include the positive example
 General to Specific learning - Specialize G to exclude the negative example
 Using the Candidate Elimination algorithm, we can compute the version space containing all (and
only those) hypotheses from H that are consistent with the given observed sequence of training
instances.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 14


Artificial Intelligence and Machine Learning (21CS54)

 The algorithm defines two boundaries called general boundary which is a set of all hypotheses that
are the most general and specific boundary which is a set of all hypotheses that are the most
specific.
 Thus, the algorithm limits the version space to contain only those hypotheses that are most general
and most specific. Thus, it provides a compact representation of List-then algorithm.

Algorithm: Candidate Elimination


Input: Set of instances in the Training dataset
Output: Hypothesis G and S
Step 1: Initialize G, to the maximally general hypotheses.
Step 2: Initialize S, to the maximally specific hypotheses.
 Generalize the initial hypothesis for the first positive instance.
Step 3: For each subsequent new training instance,
 If the instance is positive,
o Generalize S to include the positive instance,
 Check the attribute value of the positive instance and S,
 If the attribute value of positive instance and S are different, fill that field
value with '?".
 If the attribute value of positive instance and S are same, then do no
change.
o Prune G to exclude all inconsistent hypotheses in G with the positive instance.
 If the instance is negative,
o Specialize G to exclude the negative instance,
 Add to G all minimal specializations to exclude the negative example be
consistent with S.
 If the attribute value of S and the negative instance are different, then
fill that attribute value with S value.
 If the attribute value of S and negative instance are same, no need to
update 'G' and fill that attribute value with '?'.
o Remove from S all inconsistent hypotheses with the negative instance.

Generating Positive Hypothesis 'S' : If it is a positive example, refine S to include the positive
instance. We need to generalize S to include the positive instance. The hypothesis is the conjunction of
'S' and positive instance. When generalizing, for the first positive instance, add to S all minimal
generalizations such that S is filled with attribute values of the positive instance. For the subsequent
positive instances scanned, check the attribute value of the positive instance and S obtained in the

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 15


Artificial Intelligence and Machine Learning (21CS54)

previous iteration. If the attribute values of positive instance and S are different, fill that field value with
a '?'. If the attribute values of positive instance and S are same, no change is required.
If it is a negative instance, it skips.

Generating Negative Hypothesis 'G': If it is a negative instance, refine G to exclude the negative
instance. Then, prune G to exclude all inconsistent hypotheses in G with the positive instance. The idea
is to add to G all minimal specializations to exclude the negative instance and be consistent with the
positive instance. Negative hypothesis indicates general hypothesis. If the attribute values of positive
and negative instances are different, then fill that field with positive instance value so that the hypothesis
does not classify that negative instance as true. If the attribute values of positive and negative instances
are same, then no need to update 'G' and fill that attribute value with a '?".

Generating Version Space - [Consistent Hypothesis]: We need to take the combination of sets in 'G'
and check that with 'S'. When the combined set fields are matched with fields in 'S', then only that is
included in the version space as consistent hypothesis.

EXAMPLES

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rain Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Solution:

S0: (ø, ø, ø, ø, ø, ø) Most Specific Boundary


G0: (?, ?, ?, ?, ?, ?) Most Generic Boundary
The first example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the hypothesis at the generic boundary is consistent hence we retain it.
S1: (Sunny,Warm, Normal, Strong, Warm, Same)
G1: (?, ?, ?, ?, ?, ?)
The second example in positive, again the hypothesis at the specific boundary is inconsistent, hence we
extend the specific boundary, and the hypothesis at the generic boundary is consistent hence we retain it.
S2: (Sunny,Warm, ?, Strong, Warm, Same)
G2: (?, ?, ?, ?, ?, ?)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 16


Artificial Intelligence and Machine Learning (21CS54)

The third example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” (question mark) at time.
S3: (Sunny,Warm, ?, Strong, Warm, Same)
G3: (Sunny,?,?,?,?,?) (?,Warm,?,?,?,?) (?,?,?,?,?,Same)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary are retained.
S4: (Sunny, Warm, ?, Strong, ?, ?)
G4: (Sunny,?,?,?,?,?) (?,Warm,?,?,?,?)
Learned Version Space by Candidate Elimination Algorithm for given data set is:

Example 2:
Example Size Color Shape Class/Label

1 Big Red Circle No

2 Small Red Triangle No

3 Small Red Circle Yes

4 Big Blue Circle No

5 Small Blue Circle Yes


Solution:
S0: (0, 0, 0) Most Specific Boundary
G0: (?, ?, ?) Most Generic Boundary
The first example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S1: (0, 0, 0)
G1: (Small, ?, ?), (?, Blue, ?), (?, ?, Triangle)
The second example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S2: (0, 0, 0)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 17


Artificial Intelligence and Machine Learning (21CS54)

G2: (Small, Blue, ?), (Small, ?, Circle), (?, Blue, ?), (Big, ?, Triangle), (?, Blue, Triangle)
The third example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S3: (Small, Red, Circle)
G3: (Small, ?, Circle)
The fourth example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S4: (Small, Red, Circle)
G4: (Small, ?, Circle)
The fifth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S5: (Small, ?, Circle)
G5: (Small, ?, Circle)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
S: G: (Small, ?, Circle)

Example 3:

Example Citations Size InLibrary Price Editions Buy

1 Some Small No Affordable One No

2 Many Big No Expensive Many Yes

3 Many Medium No Expensive Few Yes

4 Many Small No Affordable Many Yes

Solution:
S0: (0, 0, 0, 0, 0) Most Specific Boundary
G0: (?, ?, ?, ?, ?) Most Generic Boundary
The first example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S1: (0, 0, 0, 0, 0)
G1: (Many,?,?,?, ?) (?, Big,?,?,?) (?,Medium,?,?,?) (?,?,?,Exp,?) (?,?,?,?,One) (?,?,?,?,Few)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 18


Artificial Intelligence and Machine Learning (21CS54)

The second example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S2: (Many, Big, No, Exp, Many)
G2: (Many,?,?,?, ?) (?, Big,?,?,?) (?,?,?,Exp,?) (?,?,?,?,Many)
The third example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S3: (Many, ?, No, Exp, ?)
G3: (Many,?,?,?,?) (?,?,?,exp,?)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S4: (Many, ?, No, ?, ?)
G4: (Many,?,?,?,?)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
(Many, ?, No, ?, ?) (Many, ?, ?, ?, ?)
Example 4:

Example Shape Size Color Surface Thickness Target Concept

1 Circular Large Light Smooth Thick Malignant (+)

2 Circular Large Light Irregular Thick Malignant (+)


3 Oval Large Dark Smooth Thin Benign (-)

4 Oval Large Light Irregular Thick Malignant (+)

5 Circular Small Light Smooth Thick Benign (-)

Solution:
S0: (0, 0, 0) Most Specific Boundary
G0: (?, ?, ?) Most Generic Boundary
The first example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary
S1: (Circular, Large, Light, Smooth, Thick)
G1: (?, ?, ?, ?, ?)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 19


Artificial Intelligence and Machine Learning (21CS54)

The second example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
G2: (?, ?, ?, ?, ?)
S2: (Circular, Large, Light, ?, Thick)
The third example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
G3: (Circular, ?, ?, ?, ?) (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
S3: (Circular, Large, Light, ?, Thick)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary
S4: (?, Large, Light, ?, Thick)
G4: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
The fifth example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S5: (?, ?, Light, ?, Thick)
G5: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
S5: (?, ?, Light, ?, Thick)
G5: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 20


Artificial Intelligence and Machine Learning (21CS54)

CHAPTER -04: SIMILARITY BASED LERANING

INTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING


 Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a
test instance which works in contrast with other learning mechanisms such as decision trees or
neural networks.
 Similarity-based learning is also called as Instance-based learning/Just-in time learning since it does
not build an abstract model of the training instances and performs lazy learning when classifying a
new instance.
 This learning mechanism simply stores all data and uses it only when needs to classify an unseen
instance.
 The advantage of using this learning is that processing occurs only when a request to classify a new
instance is given.
 This methodology is particularly useful when the whole dataset is not available in the beginning but
collected in an incremental manner.
 The drawback of this learning is that it requires a large memory to store the data since a global
abstract model is not constructed initially with the training data.
 Classification of instances is done based on the measure of similarity in the form of distance
functions over data instances.
 Several distance metrics are used to estimate the similarity or dissimilarity between instances
required for clustering, nearest neighbor classification, anomaly detection, and so on.
 Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance,
Minkowski distance, Cosine similarity, Mahalanobis distance, Pearson's correlation or correlation
similarity, Mean squared difference, Jaccard coefficient, Tanimoto coefficient, etc.
 Generally, Similarity-based classification problems formulate the features of test instance and
training instances in Euclidean space to learn the similarity or dissimilarity between instances.

Differences between Instance- and Model-based Learning


 Instance-based Learning also comes under the category of memory-based models which normally
compare the given test instance with the trained instances that are stored in memory. Memory-based
models classify a test instance by checking the similarity with the training instance.
 Some examples of Instance-based learning algorithms are:
1. k-Nearest Neighbor (k-NN)
2. Variants of Nearest Neighbor learning 3. Locally Weighted Regression
3. Learning Vector Quantization (LVQ)
4. Self-Organizing Map (SOM)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 21


Artificial Intelligence and Machine Learning (21CS54)

5. Radial Basis Function (RBF) networks.


 These instance-based methods have serious limitations about the range of feature values taken.
Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of
instances.
INSTANCE BASED LEARNING MODEL BASE LEARNING
Lazy Learners Eager Learners
Processing of training instance is done only Processing of training instance is done during
during testing phase training phase
No model is built with training instance before it Generalizes a model with training instances before
receives a test instance it receives a test instance
Predicts the class of the test instance directly Predicts the class of the test instance from the
from the training data model built
Slow in testing phase Fast in testing phase
Learns by making many local approximations Learns by creating global approximations

NEAREST-NEIGHBOR LEARNING
 A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a
non-parametric method used for both classification and regression problems.
 It is a simple and powerful non-parametric algorithm that predicts the category of the test instance
according to the 'k' training samples which are closer to the test instance and classifies it to that
category which has the largest probability.
 A visual representation of this learning is shown in Figure 4.1.


 There are two classes of objects called C, and C, in the given figure. When given a test instance T,
the category of this test instance is determined by looking at the class of k = 3 nearest neighbors.
 Thus, the class of this test instance T is predicted as C.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 22


Artificial Intelligence and Machine Learning (21CS54)

K- Nearest Neighbors Algorithm


 k-NN performs instance-based learning which just stores the training data instances learning
instances case by case.
 The model is also 'memory-based' as it uses training data at when predictions need to be made.
 It is a lazy learning algorithm since no prediction model is built earlier with training instances and
classification happens only after getting the test instance.
 The algorithm classifies a new instance by determining the 'K' most similar instances (i.e k nearest
neighbors) and summarizing the output of those instances.
 If the target variable is discrete then it is a classification problem, so it selects the most common
class value among the instances by a majority vote.
 However, if the target variable is continuous then it is a regression problem, and hence the mean
output variable of the 'K' instances is the output of the test instance.
 The most popular distance measure such as Euclidean distance is used in k-NN to determine the 'k'
instances which are similar to the test instance.
 The value of 'K' is best determined by turning with different 'K' values and choosing the 'K' which
classifies the test instance more accurately.
 Data normalization/standardization is required when data (features) have different ranges or a
wider range of possible values when computing distances and to transform all features to a
specific range.
 k-NN classifier performance is strictly affected by three factors such as the number of nearest
neighbors (i.e., selection of k), distance metric and decision rule.
 If the k value selected is small then it may result in overfitting or less stable and if it is big then it
may include many irrelevant points from other classes.
 The choice of the distance metric selected also plays a major role and it depends on the type of
the independent attributes in the training dataset
 The k-NN classification algorithm best suits lower dimensional data asin a high-dimensional
space the nearest neighbors may not be very close at all.

Algorithm: k-NN

Inputs: Training dataset T, distance metric d, Test instance t, the number of nearest neighbors Output:
Predicted class or category.
Prediction: For test instance 1,
Step 1: For each instance i in T, compute the distance between the test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes- Euclidean distance between two points in the plane with coordinates (x1, y1)
and (x2, y2) is given as dist ((x1, y2), (x1, y2)) = √( ) ( ) ]

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 23


Artificial Intelligence and Machine Learning (21CS54)

[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the
distance d will be equal to 0 otherwise d = 1.]

Step 2: Sort the distances in an ascending order and select the first k nearest training data instances to
the test instance.

Step 3: Predict the class of the test instance by majority voting (if target attribute is discrete valued)
or mean (if target attribute is continuous valued) of the k selected nearest instances.

Example:

Brightness Saturation Class


40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 ?
To know the class of the new instance, we have to calculate the distance from the new entry to other
entries in the data set using the Euclidean distance formula.
Here's the formula: √( ) ( )

Step 1:

Brightness Saturation Class Euclidean distance

40 20 Red √( ) ( ) =25

50 50 Blue √( ) ( ) = 33.54

60 90 Blue √( ) ( ) =68.01

10 25 Red √( ) ( ) = 10

70 70 Blue √( ) ( ) = 61.03

60 10 Red √( ) ( ) = 47.17

25 80 Blue √( ) ( ) = 45
20 35 ?

Step 2: Assuming k =3 , select the 3 instance with least Euclidean distance

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 24


Artificial Intelligence and Machine Learning (21CS54)

Brightness Saturation Class Euclidean distance


10 25 Red 10
40 20 Red 25
50 50 Blue 33.54

Step 3: Since Red has majority voting. The given new instance/ test data (20, 35, ? ) will be classified as
RED.

Example 2:
Sl. No CGPA Assessment Project Submitted Result
1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass

Consider the new test instance as (6.1, 40, 5)

Step 1:
Sl. Project Euclidean distance
CGPA Assessment Result
No Submitted

1 9.2 85 8 Pass √( ) ( ) ( )
= 45.2063
2 8 80 7 Pass √( ) ( ) ( )
= 40.095
3 8.5 81 8 Pass √( ) ( ) ( )
= 41.179
4 6 45 5 Fail √( ) ( ) ( )
= 5.001
5 6.5 50 4 Fail √( ) ( ) ( )
= 10.057
6 8.2 72 7 Pass √( ) ( ) ( )
= 32.131
7 5.8 38 5 Fail √( ) ( ) ( )
= 2.022
√( ) ( ) ( )
8 8.9 91 9 Pass
= 51.233

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 25


Artificial Intelligence and Machine Learning (21CS54)

Step 2: Assign k = 3 select the 3 instance with least Euclidean distance

CGPA Assessment Project Result Euclidean distance


Submitted
6 45 5 Fail 5.001

6.5 50 4 Fail 10.057


5.8 38 5 Fail 2.022

Step 3: Since Fail has majority voting. The given new instance/ test data (6.1, 40, 5 ) will be classified
as Fail.

WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM


 The Weighted k-NN is an extension of k-NN. It chooses the neighbors by using the weighted
distance. The k-Nearest Neighbor (k-NN) algorithm has some serious limitations as its performance
is solely dependent on choosing the k nearest neighbors, the distance metric used and the decision
rule.
 However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance are
assigned a higher weight in the decision as compared to neighbors that are farther away from the
test instance.
 The idea is that weights are inversely proportional to distances. The selected k nearest neighbors
can be assigned uniform weights, which means all the instances in each neighborhood are weighted
equally or weights can be assigned by the inverse of their distance.
 In the second case, closer neighbors of a query point will have a greater influence than neighbors
which are further away.

Algorithm: Weighted k-NN

Inputs: Training dataset 'T', Distance metric 'd', Weighting function w(i), Test instance 't', the number of
nearest neighbors „k‟.
Output: Predicted class or category
Prediction: For test instance t,
Step 1: For each instance 'i‟ in Training dataset T, compute the distance between the test instance t and
every other instance 'i' using a distance metric (Euclidean distance).
Step 2: Sort the distances in the ascending order and select the first 'k nearest training data instances to
the test instance.
Step 3: Predict the class of the test instance by weighted voting technique (Weighting function w(i)) for
the k selected nearest instances:
 Compute the inverse of each distance of the 'K' selected nearest instances.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 26


Artificial Intelligence and Machine Learning (21CS54)

 Find the sum of the inverses.


 Compute the weight by dividing each inverse distance by the sum. (Each weight is a vote for
its associated class).
 Add the weights of the same class.
 Predict the class by choosing the class with the maximum vote.

Example:
Consider the same training dataset given below Table. Use Weighted k-NN and determine the class of
the new test data (7.6, 60, 8).

Sl. No CGPA Assessment Project Submitted Result


1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass

Step 1:

Sl. Project Euclidean distance


CGPA Assessment Result
No Submitted

1 9.2 85 8 Pass √( ) ( ) ( )
= 25.051
2 8 80 7 Pass √( ) ( ) ( )
= 20.028
3 8.5 81 8 Pass √( ) ( ) ( )
= 21.019
4 6 45 5 Fail √( ) ( ) ( )
= 15.380
5 6.5 50 4 Fail √( ) ( ) ( )
= 10.826
6 8.2 72 7 Pass √( ) ( ) ( )
= 12.056
7 5.8 38 5 Fail √( ) ( ) ( )
= 22.276
√( ) ( ) ( )
8 8.9 91 9 Pass
= 31.043

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 27


Artificial Intelligence and Machine Learning (21CS54)

Step 2: Sort the distance according to ascending order and Select the first 3 nearest training data
instances to the test instance. The selected nearest neighbors are

Sl. Project Euclidean distance


CGPA Assessment Result
No Submitted

5 6.5 50 4 Fail √( ) ( ) ( )
= 10.826
6 8.2 72 7 Pass √( ) ( ) ( )
= 12.056
4 6 45 5 Fail √( ) ( ) ( )
= 15.380

Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest
instance.
 Compute the inverse of each distance of the 3 selected nearest insatnces

Inverse
CGPA Assessment Project Result Euclidean distance (ED) Distance
Submitted (1/ED)

6.5 50 4 Fail √( ) ( ) ( ) 1/10.826


= 10.826 = 0.0923
8.2 72 7 Pass √( ) ( ) ( ) 1/12.056
= 12.056 = 0.0829
6 45 5 Fail √( ) ( ) ( ) 1/15.380
= 15.380 = 0.0650

 Find the sum of the inverse


Sum= 0.0923 + 0.0829 + 0.0650 = 0.2403

 Compute the weight by dividing each inverse distance by the sum as


Euclidean Inverse Weight =
CGPA Assessment Project Result distance Distance (1/ED) Inverse
Submitted
(ED) distance /Sum
6.5 50 4 Fail 1/10.826 0.0923/ 0.2403
10.826
= 0.0923 =0.3843
8.2 72 7 Pass 1/12.056 0.0829 / 0.2403
12.056
= 0.0829 =0.3451
6 45 5 Fail 1/15.380 0.0650 / 0.2403
15.380
= 0.0650 =0.2705

 Add the weights of the same class


Fail = 0.2705 + 0.3843 = 0.6548
Pass = 0.3451
 Predict the class by choosing the class with the maximum vote.
Since the value of Fail is more the new instance belongs to Fail.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 28


Artificial Intelligence and Machine Learning (21CS54)

NEAREST CENTROID CLASSIFIER


 A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid
Classifier.
 It is a simple classifier and also called as Mean Difference classifier.
 The idea of this classifier is to classify a test instance to the class whose centroid/mean is closest to
that instance.
Algorithm: Nearest Centroid Classifier
Inputs: Training dataset T, Distance metric d. Test instance t
Output: Predicted class or category
Step 1: Compute the mean/centroid of each class.
Step 2: Compute the distance between the test instance and mean/centroid of each class (Euclidean
Distance).
Step 3: Predict the class by choosing the class with the smaller distance.

Example: Consider the sample data shown in below Table with two features x and y. The target classes
are 'A' or 'B'. Predict the class of the instance (6, 5) using Nearest Centroid Classifier.
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B

Step 1: Compute the mean/centroid of each class. In this example there are 2 classes called „A‟ & „B‟
Centroid of class „A‟ = (3 +5 + 4 , 1 + 2 + 3)/ 3 =( 12, 6)/3 = (4 , 2)
Centroid of class „B‟ = (7 +6 + 8 , 6 + 7 + 5)/ 3 =( 21, 18)/3 = (7 , 6)

Step 2: Calculate the Euclidean distance between test instance (6 ,5) and each of the centroid.
Euc_distance √( ) ( ) = 3.6

Euc_distance √( ) ( ) = 1.414

Step 3: The test instance has smaller distance to class B as the value of Class B is smaller than Class A

Example 2:

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 29


Artificial Intelligence and Machine Learning (21CS54)

LOCALLY WEIGHTED REGRESSION (LWR)


 Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that
performs local regression by combining regression model with nearest neighbor's model.
 LWR is also referred to as a memory-based method as it requires training data while prediction but
uses only the training data instances locally around the point of interest.
 Using nearest neighbors algorithm, we find the instances that are closest to a test instance and fit
linear function to each of those 'k nearest instances in the local regression model.
 The key idea is that we need to approx imate the linear functions of all 'k neighbors that minimize
the error such that the prediction line is no more linear but rather it is a curve.
 Ordinary linear regression finds out a linear relationship between the input x and the output y.
Given training dataset T,
 Hypothesis function h(x), the predicted target output is a linear function where β0 , is the intercept
and β1, is the coefficient of x.
h(x) = β0 + β1 * x
 The cost function is such that it minimizes the error difference between the predicted value h(x) and
true value 'yI' and it is given as.

J(β) = ∑ ( ( ) )
where 'm' is the number of instances in the training dataset.
 Now the cost function is modified for locally weighted linear regression including the weights only
for the nearest neighbor points.
 Hence, the cost function is given as

J(β) = ∑ ( ( ) )
where wi is the weight associated with each xi.
 The weight function used is a Gaussian kernel that gives a higher value for instances that are close
to the test instance, and for instances far away, it tends to zero but never equals to zero.
 wi is computed in as,

( )
Wi =

where is called the bandwidth parameter and controls the rate at which

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 30


Artificial Intelligence and Machine Learning (21CS54)

REGRESSION ANALYSIS
Regression analysis is a supervised learning method for predicting continuous variables. The difference
between classification and regression analysis is that regression methods are used to predict qualitative
variables or continuous numbers unlike categorical variables or labels. It is used to predict linear or non-
linear relationships among variables of the given dataset.

INTRODUCTION TO REGRESSION
 Regression analysis is the premier method of supervised learning. This is one of the most popular and
oldest supervised learning techniques.
 Given a training dataset D containing N training points (x, y), where i = 1...N, regression analysis is
used to model the relationship between one or more independent variables x, and a dependent
variable y.
 The relationship between the dependent and independent variables can be represented as a function
as follows:
Y = f(x)
 Here, the feature variable x is also known as an explanatory variable, exploratory variable, a
predictor variable, an independent variable, a covariate, or a domain point.
 y is a dependent variable. Dependent variables are also called as labels, target variables, or response
variables.
 Regression analysis determines the change in response variables when one exploration variable is
varied while keeping all other parameters constant.
 This is used to determine the relationship each of the exploratory variables exhibits. Thus, regression
analysis is used for prediction and forecasting.
 Regression is used to predict continuous variables or quantitative variables such as price and revenue.
Thus, the primary concern of regression analysis is to find answer to questions such as:
 What is the relationship between the variables?
 What is the strength of the relationships?
 What is the nature of the relationship such as linear or non-linear?
 What is the relevance of the attributes?
 What is the contribution of each attribute?
 There are many applications of regression analysis. Some of the applications of regressions include
predicting:
 Sales of a goods or services ‫جاده‬
 Value of bonds in portfolio management
 Premium on insurance companies

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 31


Artificial Intelligence and Machine Learning (21CS54)

 Yield of crops in agriculture


 Prices of real estate.
INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
The quality of the regression analysis is determined by the factors such as correlation and causation.
Regression and Correlation
 Correlation among two variables can be done effectively using a Scatter plot, which is a plot between
explanatory variables and response variables.
 It is a 2D graph showing the relationship between two variables. The x-axis of the scatter plot is
independent, or input or predictor variables and y-axis of the scatter plot is output or dependent or
predicted variables.
 The scatter plot is useful in exploring data. Some of the scatter plots are shown in Figure 5.1.
 The Pearson correlation coefficient is the most common test for determining correlation if there is
an association between two variables.
 The correlation coefficient is denoted by r.
 The positive, negative, and random correlations are given in Figure 5.1.

 In positive correlation, one variable change is associated with the change in another variable.
 In negative correlation, the relationship between the variables is reciprocal while in random
correlation, no relationship exists between variables.
 While correlation is about relationships among variables, say x and y, regression is about predicting
one variable given another variable.
Regression and Causation
 Causation is about causal relationship among variables, say x and y.
 Causation means knowing whether x causes y to happen or vice versa.
 x causes y is often denoted as x implies y.
 Correlation and Regression relationships are not same as causation relationship.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 32


Artificial Intelligence and Machine Learning (21CS54)

 For example, the correlation between economical background and marks scored does not imply that
economic background causes high marks.
Linearity and Non-linearity Relationships
 The linearity relationship between the variables means the relationship between the dependent and
independent variables can be visualized as a straight line.
 The line of the form, y = ax + b can be fitted to the data points that indicate the relationship between
x and y.
 By linearity, it is meant that as one variable increases, the corresponding variable also increases in a
linear manner.
 A linear relationship is shown in Figure 5.2 (a). A non-linear relationship exists in functions such as
exponential function and power function and it is shown in Figures 5.2 (b) and 5.2 (c).

Here, x-axis is given by x data and y-axis is given by y data.


 The functions like exponential function (y = axb) and power function [y = x / ax+b] are non-linear
relationships between the dependent and independent variables that cannot be fitted in a line.
 This is shown in Figures 5.2 (b) and (c).

TYPES OF REGRESSION METHODS


The classification of regression methods are

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 33


Artificial Intelligence and Machine Learning (21CS54)

Linear Regression: It is a type of regression where a line is fitted upon given data for finding the linear
relationship between one independent variable and one dependent variable to describe relationships.

Multiple Regression: It is a type of regression where a line is fitted for finding the linear relationship
between two or more independent variables and one dependent variable to describe relationships among
variables.

Polynomial Regression: It is a type of non-linear regression method of describing relationships among


variables where Nth degree polynomial is used to model the relationship between one independent
variable and one dependent variable. Polynomial multiple regression is used to model two or more
independent variables and one dependant variable.

Logistic Regression: It is used for predicting categorical variables that involve one or more independent
variables and one dependent variable. This is also known as a binary classifier.

Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.

LIMITATIONS OF REGRESSION METHOD

1. Outliers-Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push
the regression line towards it.

2. Number of cases - The ratio of independent and dependent variables should be at least 20:1. For
every explanatory variable, there should be at least 20 samples. Atleast five samples are required in
extreme cases.

3. Missing data - Missing data in training data can make the model unfit for the sampled data.

4. Multicollinearity - If exploratory variables are highly correlated, the regression is vulnerable to bias.
Singularity leads to perfect correlation of 1.

INTRODUCTION TO LINEAR REGRESSION


 In the simplest form, the linear regression model can be created by fitting a line among the scattered
data points. The line is of the form
y = a0 + a1* x + e
 Here, a0 is the intercept which represents the bias and a1, represents the slope of the line. These are
called regression coefficients. e is the error in prediction.
 The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also mutually
independent with the same distributions such as normal distribution with zero mean and constant
variables.
3. The distribution of the error term is independent of the joint distribution of explanatory variables.
4. The unknown parameters of the regression models are constants.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 34


Artificial Intelligence and Machine Learning (21CS54)

 The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method is also
known as ordinary least squares method.
 In this method, the data points are modelled using a straight line. Any arbitrarily drawn line is not an
optimal line.

 The vertical distance between each point and the line is called an error. These individual errors are
added to compute the total error of the predicted line. This is called sum of residuals.
 The squares of the individual errors can also be computed and added to give a sum of squared error.
The line with the lowest sum of squared error is called line of best fit.
 In another words, OLS is an optimization technique where the difference between the data points and
the line is optimized.
 Mathematically, based on the line equations for points (x1, x2,...,xn) are:

y1 = (a0+ a1x1) + e1,


y₂ = (a0+a1 x2) + e2,
. . . .
yn = (a0+a1 xn) + en,
 In general, the error is given as: ei = yi - (a0 + a1 xi)
 Here, the terms (e1, e2,…. en) are error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as residuals.
The residuals can be positive, negative or zero.
 A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
 The minimization can be done as minimization of individual errors by finding the parameters a 0 and
a1, such that:
E= ∑ = ∑ ( ( ))
Or as the minimization of sum of absolute values of the individual errors:
E= ∑ = ∑ ( ( ))
Or as the minimization of the sum of the squares of the individual errors:

E= ∑ ( ) = ∑ ( ( ))
 Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled
out and are always positive, and sum of squares results in a large increase even for a small change in
the error.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 35


Artificial Intelligence and Machine Learning (21CS54)

 Therefore, this is preferred for linear regression. Therefore, linear regression is modelled as a
minimization function as follows:
J ( a0, a1 ) = ∑ ( )

= ∑ ( ( ))
 Here, J(a, a) is the criterion function of parameters a0 and a1. This needs to be minimized. This is
done by differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The
values of estimates of a0 and a1 are given as follows:
(̅̅̅) ( ̅ )( ̅)
(̅̅̅̅) ( ̅ )
 And the value of a, is given as follows:
( ̅) ̅

Problems
Let us consider an example where the five week‟s sales data (in Thousands) is given as shown below.
Apply linear regression technique to predict the 7th and 9th week sales.

Xi (Week) Yi (sales in Thousands)


1 1.2
2 1.8
3 2.6
4 3.2
5 3.8
Here there are 5 items i.e i=1,2,3,4,5. The computation table is shown below. Here, there are 5 samples,
so i ranges from 1 to 5.
Xi Yi Xi2 Xi * Yi
1 1.2 1 1.2
2 1.8 4 3.6
3 2.6 9 7.8
4 3.2 16 12.8
5 3.8 25 19
Sum =15 Sum =12.6 Sum =55 Sum = 44.4
2
Average of Average of (yi) Average of (xi ) Average of (xi * yi)
(xi) = ̅ = 3 = ̅ = 2.52 = ̅̅̅̅ = 11 ̅̅̅ = 8.88

Solution
( ) ( )( )
( ) ( )

( )

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 36


Artificial Intelligence and Machine Learning (21CS54)

The predicted 7th week sale would be (when x= 7), y = 0.54 + 0.66 *7 = 5.16 and 9th week sales would
be (when x=9), y = 0.54 + 0.66 *9 = 6.48

Linear Regression in Matrix Form


Matrix notations can be used for repressing the values of independent and dependent variables.
The equation can be written in the form of matrix as follows:

= * ++

[ ] [ ] [ ]

This can be written as


Y= Xa + e, where X is an n*2 matrix, Y is an n*1 vector, a is a 2* 1column vector and e is an n*1
column vector.

Problem
Find linear regression of the data of week and product sales (in Thousands) given. Use linear regression
in matrix form.
Xi (week) Yi (product Sales in Thousands)
1 1
2 3
3 4
4 8

Solution:
Here, the dependent variable X is be given as
XT = [1 2 3 4]
And the independent variable is given as follows:
YT = [1 3 4 8]
The data can be given in matrix form as follows:

X= [ ] Y=[ ]

The regression is given as: a= ((XTX)-1 XT) Y

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 37


Artificial Intelligence and Machine Learning (21CS54)

The computation order of this equation is shown step by step as

1. Computation of (XTX) = ( )*[ ] = ( )

2. Computation of matrix inverse of (XTX)-1 = ( ) = ( )

3. Computation of ((XTX)-1 XT)

=( )*( ) =( )

4. Finally ((XTX)-1 XT) Y

( )* [ ] =( ) ( )

VALIDATION OF REGRESSION METHODS


The regression model should be evaluated using some metrics for checking the correctness. The
following metrics are used to validate the results of regression.

Standard Error Estimate


Residuals or error is the difference between the actual (y) and predicted value (ý). If the residuals have
normal distribution, then the mean is zero and hence it is desirable. The standard deviation of residuals
is called residual standard error. If it is zero, then it means that the model fits the data correctly.

Standard error estimate is another useful measure of regression. It is the standard deviation of the
observed values to the predicted values. This is given as:
∑( ̂)

Here yi is the observed value and ̂ is the predicted value. Here, n is the number of samples.

Mean Absolute Error (MAE)


MAE is the mean of residuals. It is the difference between estimated or predicted target value and actual
target incomes. It can be mathematically defined as follows:
MAE = ∑ ̂
Here ̂ is the estimated or predicted target output and y is the actual target output and n is the number of
samples used for regression analysis.

Mean Squared Error (MSE)


It is the sum of square of residuals. This value is always positive and closer to 0. This is given
mathematically as:
MSE = ∑ ( ̂)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 38


Artificial Intelligence and Machine Learning (21CS54)

Root Mean Square Error (RMSE)


The square root of the MSE is called RMSE. This is given as

RMSE = √ =√ ∑ ( ̂)

Relative MSE
Relative MSE is the ratio of the prediction ability of the ̂ to the average of the trivial population. The
value of zero indicates that the model is prefect and its value ranges between 0 to 1. If the value is more
than 1, then the created model is not a good one.
∑ ( ̂)
RelMSE =
∑ ( ̅)

Coefficient of Variation
Coefficient of variation is unit less and is given as:
CV =
̅

Example:
Consider the following training set for predicting the sales of the items.

Xi (Items) Yi (Actual Sales in


Thousands)
1 80
2 90
3 100
4 110
5 120

Consider two fresh item 6 and 7, whose actual values are 80 and 90 respectively. A regression model
predicts the values of the items 6 and 7 as 75 and 82 respectively. Find MAE, MSE, RMSE, RelMSE,
CV.

Solution:

Test Item Actual Value Yi Predicted Value ̂


6 80 75
7 75 85

Mean Absolute Error (MAE)


MAE =

Mean Squared Error (MSE)


MSE =1/2 * |80-75| 2 + |75-85|2 = 62.5

Root Mean Square Error (RMSE)


RMSE = √ =

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 39


Artificial Intelligence and Machine Learning (21CS54)

Relative MSE
( ) ( )
RelMSE= = 0.1219
( ) ( )

Coefficient of Variation
CV = = 0.08

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy