AIML Module-03
AIML Module-03
MODULE-03
It can be observed that training samples and target function are dependent on the given problem.
The learning algorithm and hypothesis set are independent of the given problem. Thus learning
model is informally the hypothesis set and learning algorithm. Thus, learning model cat be stated
as follows:
Learning Model = Hypothesis Set + Learning Algorithm
Let us assume a problem of predicting a label for a given input data. Let D be the input dataset with
both positive and negative examples. Let y be the output with class 0 or 1.
The simple learning model can be given as:
∑ > Threshold, belongs to class 1 and
∑ < Threshold, belongs to another class
This can be put into a single equation as follows:
h(x) = sign((∑ ) + b)
where, x1,x2,x3 are the components of the input vector, w1,w2,w3,..., wd are the weights and +1 and
-1 represent the class. This simple model is called perception model. One can simplify this by making
w0 = b and fixing it as 1, then the model can further be simplified as:
h(x) = sign(wT x).
This is called perception learning algorithm.
Learning Types
There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by
memorizing without understanding the logic or concept. Although rote learning is basically
learning by repetition, in machine learning perspective, the learning occurs by simply comparing
with the existing knowledge for the same input data and producing the output if present.
2. Learn by examples also called as learn by experience or previous knowledge acquired at some
time, is like finding an analogy, which means performing inductive learning from observations
that formulate a general concept. Here, the learner learns by inferring a general rule from the set
of observations or examples. Therefore, inductive learning is also called as discovery learning.
3. Learn by being taught by an expert or a teacher, generally called as passive learning However,
there is a special kind of learning called active learning where the learner can interactively query
a teacher/expert to label unlabelled data instances with the desired outputs.
4. Learning by critical thinking, also called as deductive learning, deduces new facts or
conclusion from related known facts and information.
5. Self-learning, also called as reinforcement learning, is a self-directed learning that normally
learns from mistakes punishments and rewards.
6. Learning to solve problems is a type of cognitive learning where learning happens in the mind
and is possible by devising a methodology to achieve a goal. The learning happens either
directly from the initial state by following the steps to achieve the goal or indirectly by inferring
the behaviour.
7. Learning by generalizing explanations, also called as explanation-based learning (EBL), is
another learning method that exploits domain knowledge from experts to improve the accuracy
of learned concepts by supervised learning.
1. Training Experience
Let us consider designing of a chess game.
In direct experience, individual board states and correct moves of the chess game are given
directly.
In indirect system, the move sequences and results are only given.
The training experience also depends on the presence of a supervisor who can label all valid
moves for a board state.
In the absence of a supervisor, the game agent plays against itself and learns the good moves.
If the training samples and testing samples have the same distribution, the results would be
good.
where, x1, x2, and x3, represent different board features and w0, w1 and w3, represent weights.
E ∑ [ ( ) ̂( ) ]
Here, b is the sample and V(b) is the predicted hypothesis. The approximation is carried out as
Computing the error as the difference between trained and expected hypothesis. Let error
be error(b).
Then, for every board feature xi, the weights are updated as:
Wi = Wi + μ * error(b) * xi
Here, μ is the constant that moderates the size of the weight update.
Thus, the learning system has the following components:
i. A Performance system to allow the game to play against itself.
ii. A Critic system to generate the samples.
iii. A Generalizer system to generate a hypothesis based on samples.
iv. An Experimenter system to generate a new system based on the currently learnt function.
This is sent as input to the performance system.
Formally, Concept learning is defined as "Given a set of hypotheses, the learner searches through
the hypothesis space to identify the best hypothesis that matches the target concept".
Consider the following set of training instances shown in Table 3.1.
Table: 3.1- Enjoy sport data set
The task is to predict the best hypothesis of the target concept. The most general hypothesis can
allow any value for each of the attribute.
It is represented as <?, ? , ? ,? , ? , ?> this hypothesis indicates that any day person can enjoy
sports.
The most specific hypothesis will not allow any value for each attribute < ø, ø, ø, ø, ø, ø, ø
>. This hypothesis indicates that none of the day person can enjoy sports.
Concept learning is also called as Inductive Learning that tries to induce a general function from
specific training instance.
Hypothesis Space
Hypothesis space is the set of all possible hypotheses that approximates the target function f. In
other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning
algorithm would determine the best possible hypothesis that would best describe the target
function or best fit the outputs.
Generally, a hypothesis representation language represents a larger hypothesis space.
Every machine learning algorithm would represent the hypothesis space in a different manner
about the function that maps the input variables to output variables.
The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
The subset of hypothesis space that is consistent with all-observed training instances is called as
Version Space. Version space represents the only hypotheses that are used for the classification.
For example, each of the attribute given in the Table 3.1 (Enjoy sport table) has the following
possible set of values.
Sky= Sunny, Rainy
Air Temp = Warm, Cold
Humidity = Normal, High
Wind= Strong
Water =Warm, Cool
Forecast= Same, Change
Considering these values for each of the attribute, there are (2*2*2*1*2*2) = 32 distinct
instances covering all the 4 instances in the training dataset.
So, we can generate (4*4*4*3*4*4) = 3072 distinct hypotheses when including two more values
[?, ] for each of the attribute. However, any hypothesis containing one or more ϕ symbols
represents the empty set of instances; that is, it classifies every instance as negative instance.
Therefore, there will be 1+ (3*3*3*2*3*3) = 487 distinct hypotheses by including only "?' for
each of the attribute and one hypothesis representing the empty set of instances.
Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to
search for the best hypothesis from the set of hypotheses.
Hypothesis ordering is also important wherein the hypotheses are ordered from the most specific
one to the most general one in order to restrict searching the hypothesis space exhaustively.
h=< ø, ø, ø, ø, ø, ø, ø, ø >
Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified
by the hypothesis h1.
I1: <No, Short, Yes, No, No, Black, No, Big> →Yes [positive]
h1: <No, Short, Yes, No, No, Black, No, Big>
Third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it, The resulting h3 is
generalized
I3: <No, Short, Yes, No, No, Black, No, Medium> →Yes [positive]
h3: <No, Short, Yes, No, No, Black, No, ? >
Specialization- General to specific Learning: The learning methodology will search through the
hypothesis space for an approximate hypothesis by specializing the most general hypothesis.
Example: Consider the training instances shown in Table 3.2 and illustrate General to Specific
Learning.
Sl. No Horns Tail Tusks Paws Fur Color Hooves Size Elephant
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
Solution: Start from the most general hypothesis which will make true all positive and negative
instances.
Initially,
h = <?, ?, ? ,? ,? ,? ,? ,?>
h is more general to classify all instances to true.
Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with
the positive instances in the training dataset. Obviously, it will also be consistent with the negative
instances.
Thus, this algorithm considers only the positive instances and eliminates negative instances while
generating the hypothesis. It initially starts with the most specific hypothesis.
Algorithm: Find-S
Input: Positive instances in the Training dataset
Output: Hypothesis 'h'
Step 1: Initialize 'h' to the most specific hypothesis.
h=< ø, ø, ø, ø, ø, ø, ø >
Step 2: Generalize the initial hypothesis for the first positive instance [Since 'h' is more specific). 3
Step 3: For each subsequent instances:
If it is a positive instance,
Check for each attribute value in the instance with the hypothesis 'h'.
If the attribute value is the same as the hypothesis value, then do nothing
Else if the attribute value is different than the hypothesis value, change it to'?' in
'h'.
Else if it is a negative instance, Ignore it.
Example: Apply Find-S algorithm to the below table and find the Maximally Specific Hypothesis.
Step 2: Iteration-1
Given instance X1 = <Sunny, Warm, Normal, Strong, Warm, Same> →Yes [positive]
h1 = <Sunny, Warm, Normal, Strong, Warm, Same>
Step 2: Iteration-2
Given instance X2 = <Sunny, Warm, High, Strong, Warm, Same> →Yes [positive]
h2 = <Sunny, Warm, ?, Strong, Warm, Same>
Step 2: Iteration-3
Given instance X3 = <Rainy, Cold, High, Strong, Warm, Change> →No [negative]
Since X3 is Negative this example is ignored, therefore
h3 = <Sunny, Warm, ?, Strong, Warm, Same>
Step 2: Iteration-4
Given instance X4 = <Sunny, Warm, High, Strong, Cool, Change> →Yes [positive]
h4 = <Sunny, Warm, ?, Strong, ?, ?>
Step 3:
The final maximally specific hypothesis is <Sunny, Warm, ?, Strong, ?, ?>
Example 2:
Step 1:
h0 = (ø, ø, ø, ø, ø)
Step 2: Iteration 1
Given instance X1 = < some, small, no, expensive, many > →No [negative]
Since X1 is Negative this example is ignored, therefore
h1 = < ø, ø, ø, ø, ø >
Step 2: Iteration 2
Given instance X2 = < many, big, no, expensive, one > →Yes [positive]
h2 = < many, big, no, expensive, one >
Step 2: Iteration 3
X3 = < some, big, always, expensive, few > →No [negative]
Step 2: Iteration 4
X4 = <many, medium, no, expensive, many> →Yes [positive]
h4 = <many, ?, no, expensive, ?>
Step 2: Iteration 5
X5 = <many, small, no, affordable, many> →Yes [positive]
h5 = (many, ?, no, ?, ?)
Step 3:
Example 3:
Sl. No Horns Tail Tusks Paws Fur Color Hooves Size Elephant
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes
Step 1:
h0 = (ø, ø, ø, ø, ø)
Step 2: Iteration 1
Given instance X1 = < No, Short, Yes, No, No, Black, No, Big > →Yes [positive]
h1 = < No, Short, Yes, No, No, Black, No, Big>
Step 2: Iteration 2
Given instance X2 = < Yes, Short, No, No, No, Brown, Yes, Medium> →No [negative]
Since X2 is Negative this example is ignored, therefore
h2 = < No, Short, Yes, No, No, Black, No, Big>
Step 2: Iteration 3
X3 = < No, Short, Yes, No, No, Black, No, Medium> →Yes [positive]
h3 = < No, Short, Yes, No, No, Black, No, ? >
Step 2: Iteration 4
X4 = < No, Long, No, Yes, Yes, White, No, Medium> →No [negative]
Since X4 is Negative this example is ignored, therefore
h4 = < No, Short, Yes, No, No, Black, No, ? >
Step 2: Iteration 5
X5 = <No, Short, Yes, Yes, Yes, Black, No, Big> →Yes [positive]
h5 = < No, Short, Yes, ?, ?, Black, No, ? >
Step 3:
Final Maximally Specific Hypothesis is: h5 = < No, Short, Yes, ?, ?, Black, No, ? >
3. Many times, the training dataset may contain some errors; hence such inconsistent data instances can
mislead this algorithm in determining the consistent hypothesis since it ignores negative instances.
Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples. To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm
was proposed to output the set of all hypotheses consistent with the training dataset.
Version Spaces
The version space contains the subset of hypotheses from the hypothesis space that is consistent with all
training instances in the training dataset.
List-Then-Eliminate Algorithm
The principle idea of this learning algorithm is to initialize the version space to contain all
hypotheses and then eliminate any hypothesis that is found inconsistent with any training instances.
Initially, the algorithm starts with a version space to contain all hypotheses scanning each training
instance.
The hypotheses that are inconsistent with the training instance are eliminated.
Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.
Algorithm: List-Then-Eliminate
Input: Version Space - a list of all hypotheses
Output: Set of consistent hypotheses
Step 1: Initialize the version space with a list of hypotheses.
Step 2: For each training instance,
remove from version space any hypothesis that is inconsistent.
This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this
algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.
The algorithm defines two boundaries called general boundary which is a set of all hypotheses that
are the most general and specific boundary which is a set of all hypotheses that are the most
specific.
Thus, the algorithm limits the version space to contain only those hypotheses that are most general
and most specific. Thus, it provides a compact representation of List-then algorithm.
Generating Positive Hypothesis 'S' : If it is a positive example, refine S to include the positive
instance. We need to generalize S to include the positive instance. The hypothesis is the conjunction of
'S' and positive instance. When generalizing, for the first positive instance, add to S all minimal
generalizations such that S is filled with attribute values of the positive instance. For the subsequent
positive instances scanned, check the attribute value of the positive instance and S obtained in the
previous iteration. If the attribute values of positive instance and S are different, fill that field value with
a '?'. If the attribute values of positive instance and S are same, no change is required.
If it is a negative instance, it skips.
Generating Negative Hypothesis 'G': If it is a negative instance, refine G to exclude the negative
instance. Then, prune G to exclude all inconsistent hypotheses in G with the positive instance. The idea
is to add to G all minimal specializations to exclude the negative instance and be consistent with the
positive instance. Negative hypothesis indicates general hypothesis. If the attribute values of positive
and negative instances are different, then fill that field with positive instance value so that the hypothesis
does not classify that negative instance as true. If the attribute values of positive and negative instances
are same, then no need to update 'G' and fill that attribute value with a '?".
Generating Version Space - [Consistent Hypothesis]: We need to take the combination of sets in 'G'
and check that with 'S'. When the combined set fields are matched with fields in 'S', then only that is
included in the version space as consistent hypothesis.
EXAMPLES
Solution:
The third example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” (question mark) at time.
S3: (Sunny,Warm, ?, Strong, Warm, Same)
G3: (Sunny,?,?,?,?,?) (?,Warm,?,?,?,?) (?,?,?,?,?,Same)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary are retained.
S4: (Sunny, Warm, ?, Strong, ?, ?)
G4: (Sunny,?,?,?,?,?) (?,Warm,?,?,?,?)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
Example 2:
Example Size Color Shape Class/Label
G2: (Small, Blue, ?), (Small, ?, Circle), (?, Blue, ?), (Big, ?, Triangle), (?, Blue, Triangle)
The third example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S3: (Small, Red, Circle)
G3: (Small, ?, Circle)
The fourth example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S4: (Small, Red, Circle)
G4: (Small, ?, Circle)
The fifth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S5: (Small, ?, Circle)
G5: (Small, ?, Circle)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
S: G: (Small, ?, Circle)
Example 3:
Solution:
S0: (0, 0, 0, 0, 0) Most Specific Boundary
G0: (?, ?, ?, ?, ?) Most Generic Boundary
The first example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S1: (0, 0, 0, 0, 0)
G1: (Many,?,?,?, ?) (?, Big,?,?,?) (?,Medium,?,?,?) (?,?,?,Exp,?) (?,?,?,?,One) (?,?,?,?,Few)
The second example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S2: (Many, Big, No, Exp, Many)
G2: (Many,?,?,?, ?) (?, Big,?,?,?) (?,?,?,Exp,?) (?,?,?,?,Many)
The third example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S3: (Many, ?, No, Exp, ?)
G3: (Many,?,?,?,?) (?,?,?,exp,?)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
S4: (Many, ?, No, ?, ?)
G4: (Many,?,?,?,?)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
(Many, ?, No, ?, ?) (Many, ?, ?, ?, ?)
Example 4:
Solution:
S0: (0, 0, 0) Most Specific Boundary
G0: (?, ?, ?) Most Generic Boundary
The first example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary
S1: (Circular, Large, Light, Smooth, Thick)
G1: (?, ?, ?, ?, ?)
The second example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary.
G2: (?, ?, ?, ?, ?)
S2: (Circular, Large, Light, ?, Thick)
The third example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
G3: (Circular, ?, ?, ?, ?) (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
S3: (Circular, Large, Light, ?, Thick)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent, hence we extend
the specific boundary, and the consistent hypothesis at the generic boundary is retained and inconsistent
hypotheses are removed from the generic boundary
S4: (?, Large, Light, ?, Thick)
G4: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
The fifth example is negative, the hypothesis at the specific boundary is consistent, hence we retain it,
and the hypothesis at the generic boundary is inconsistent hence we write all consistent hypotheses by
removing one “?” at a time.
S5: (?, ?, Light, ?, Thick)
G5: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
S5: (?, ?, Light, ?, Thick)
G5: (?, ?, Light, ?, ?) (?, ?, ?, ?, Thick)
NEAREST-NEIGHBOR LEARNING
A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a
non-parametric method used for both classification and regression problems.
It is a simple and powerful non-parametric algorithm that predicts the category of the test instance
according to the 'k' training samples which are closer to the test instance and classifies it to that
category which has the largest probability.
A visual representation of this learning is shown in Figure 4.1.
There are two classes of objects called C, and C, in the given figure. When given a test instance T,
the category of this test instance is determined by looking at the class of k = 3 nearest neighbors.
Thus, the class of this test instance T is predicted as C.
Algorithm: k-NN
Inputs: Training dataset T, distance metric d, Test instance t, the number of nearest neighbors Output:
Predicted class or category.
Prediction: For test instance 1,
Step 1: For each instance i in T, compute the distance between the test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes- Euclidean distance between two points in the plane with coordinates (x1, y1)
and (x2, y2) is given as dist ((x1, y2), (x1, y2)) = √( ) ( ) ]
[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the
distance d will be equal to 0 otherwise d = 1.]
Step 2: Sort the distances in an ascending order and select the first k nearest training data instances to
the test instance.
Step 3: Predict the class of the test instance by majority voting (if target attribute is discrete valued)
or mean (if target attribute is continuous valued) of the k selected nearest instances.
Example:
Step 1:
40 20 Red √( ) ( ) =25
50 50 Blue √( ) ( ) = 33.54
60 90 Blue √( ) ( ) =68.01
10 25 Red √( ) ( ) = 10
70 70 Blue √( ) ( ) = 61.03
60 10 Red √( ) ( ) = 47.17
25 80 Blue √( ) ( ) = 45
20 35 ?
Step 3: Since Red has majority voting. The given new instance/ test data (20, 35, ? ) will be classified as
RED.
Example 2:
Sl. No CGPA Assessment Project Submitted Result
1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass
Step 1:
Sl. Project Euclidean distance
CGPA Assessment Result
No Submitted
1 9.2 85 8 Pass √( ) ( ) ( )
= 45.2063
2 8 80 7 Pass √( ) ( ) ( )
= 40.095
3 8.5 81 8 Pass √( ) ( ) ( )
= 41.179
4 6 45 5 Fail √( ) ( ) ( )
= 5.001
5 6.5 50 4 Fail √( ) ( ) ( )
= 10.057
6 8.2 72 7 Pass √( ) ( ) ( )
= 32.131
7 5.8 38 5 Fail √( ) ( ) ( )
= 2.022
√( ) ( ) ( )
8 8.9 91 9 Pass
= 51.233
Step 3: Since Fail has majority voting. The given new instance/ test data (6.1, 40, 5 ) will be classified
as Fail.
Inputs: Training dataset 'T', Distance metric 'd', Weighting function w(i), Test instance 't', the number of
nearest neighbors „k‟.
Output: Predicted class or category
Prediction: For test instance t,
Step 1: For each instance 'i‟ in Training dataset T, compute the distance between the test instance t and
every other instance 'i' using a distance metric (Euclidean distance).
Step 2: Sort the distances in the ascending order and select the first 'k nearest training data instances to
the test instance.
Step 3: Predict the class of the test instance by weighted voting technique (Weighting function w(i)) for
the k selected nearest instances:
Compute the inverse of each distance of the 'K' selected nearest instances.
Example:
Consider the same training dataset given below Table. Use Weighted k-NN and determine the class of
the new test data (7.6, 60, 8).
Step 1:
1 9.2 85 8 Pass √( ) ( ) ( )
= 25.051
2 8 80 7 Pass √( ) ( ) ( )
= 20.028
3 8.5 81 8 Pass √( ) ( ) ( )
= 21.019
4 6 45 5 Fail √( ) ( ) ( )
= 15.380
5 6.5 50 4 Fail √( ) ( ) ( )
= 10.826
6 8.2 72 7 Pass √( ) ( ) ( )
= 12.056
7 5.8 38 5 Fail √( ) ( ) ( )
= 22.276
√( ) ( ) ( )
8 8.9 91 9 Pass
= 31.043
Step 2: Sort the distance according to ascending order and Select the first 3 nearest training data
instances to the test instance. The selected nearest neighbors are
5 6.5 50 4 Fail √( ) ( ) ( )
= 10.826
6 8.2 72 7 Pass √( ) ( ) ( )
= 12.056
4 6 45 5 Fail √( ) ( ) ( )
= 15.380
Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest
instance.
Compute the inverse of each distance of the 3 selected nearest insatnces
Inverse
CGPA Assessment Project Result Euclidean distance (ED) Distance
Submitted (1/ED)
Example: Consider the sample data shown in below Table with two features x and y. The target classes
are 'A' or 'B'. Predict the class of the instance (6, 5) using Nearest Centroid Classifier.
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B
Step 1: Compute the mean/centroid of each class. In this example there are 2 classes called „A‟ & „B‟
Centroid of class „A‟ = (3 +5 + 4 , 1 + 2 + 3)/ 3 =( 12, 6)/3 = (4 , 2)
Centroid of class „B‟ = (7 +6 + 8 , 6 + 7 + 5)/ 3 =( 21, 18)/3 = (7 , 6)
Step 2: Calculate the Euclidean distance between test instance (6 ,5) and each of the centroid.
Euc_distance √( ) ( ) = 3.6
Euc_distance √( ) ( ) = 1.414
Step 3: The test instance has smaller distance to class B as the value of Class B is smaller than Class A
Example 2:
J(β) = ∑ ( ( ) )
where 'm' is the number of instances in the training dataset.
Now the cost function is modified for locally weighted linear regression including the weights only
for the nearest neighbor points.
Hence, the cost function is given as
J(β) = ∑ ( ( ) )
where wi is the weight associated with each xi.
The weight function used is a Gaussian kernel that gives a higher value for instances that are close
to the test instance, and for instances far away, it tends to zero but never equals to zero.
wi is computed in as,
( )
Wi =
where is called the bandwidth parameter and controls the rate at which
REGRESSION ANALYSIS
Regression analysis is a supervised learning method for predicting continuous variables. The difference
between classification and regression analysis is that regression methods are used to predict qualitative
variables or continuous numbers unlike categorical variables or labels. It is used to predict linear or non-
linear relationships among variables of the given dataset.
INTRODUCTION TO REGRESSION
Regression analysis is the premier method of supervised learning. This is one of the most popular and
oldest supervised learning techniques.
Given a training dataset D containing N training points (x, y), where i = 1...N, regression analysis is
used to model the relationship between one or more independent variables x, and a dependent
variable y.
The relationship between the dependent and independent variables can be represented as a function
as follows:
Y = f(x)
Here, the feature variable x is also known as an explanatory variable, exploratory variable, a
predictor variable, an independent variable, a covariate, or a domain point.
y is a dependent variable. Dependent variables are also called as labels, target variables, or response
variables.
Regression analysis determines the change in response variables when one exploration variable is
varied while keeping all other parameters constant.
This is used to determine the relationship each of the exploratory variables exhibits. Thus, regression
analysis is used for prediction and forecasting.
Regression is used to predict continuous variables or quantitative variables such as price and revenue.
Thus, the primary concern of regression analysis is to find answer to questions such as:
What is the relationship between the variables?
What is the strength of the relationships?
What is the nature of the relationship such as linear or non-linear?
What is the relevance of the attributes?
What is the contribution of each attribute?
There are many applications of regression analysis. Some of the applications of regressions include
predicting:
Sales of a goods or services جاده
Value of bonds in portfolio management
Premium on insurance companies
In positive correlation, one variable change is associated with the change in another variable.
In negative correlation, the relationship between the variables is reciprocal while in random
correlation, no relationship exists between variables.
While correlation is about relationships among variables, say x and y, regression is about predicting
one variable given another variable.
Regression and Causation
Causation is about causal relationship among variables, say x and y.
Causation means knowing whether x causes y to happen or vice versa.
x causes y is often denoted as x implies y.
Correlation and Regression relationships are not same as causation relationship.
For example, the correlation between economical background and marks scored does not imply that
economic background causes high marks.
Linearity and Non-linearity Relationships
The linearity relationship between the variables means the relationship between the dependent and
independent variables can be visualized as a straight line.
The line of the form, y = ax + b can be fitted to the data points that indicate the relationship between
x and y.
By linearity, it is meant that as one variable increases, the corresponding variable also increases in a
linear manner.
A linear relationship is shown in Figure 5.2 (a). A non-linear relationship exists in functions such as
exponential function and power function and it is shown in Figures 5.2 (b) and 5.2 (c).
Linear Regression: It is a type of regression where a line is fitted upon given data for finding the linear
relationship between one independent variable and one dependent variable to describe relationships.
Multiple Regression: It is a type of regression where a line is fitted for finding the linear relationship
between two or more independent variables and one dependent variable to describe relationships among
variables.
Logistic Regression: It is used for predicting categorical variables that involve one or more independent
variables and one dependent variable. This is also known as a binary classifier.
Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.
1. Outliers-Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push
the regression line towards it.
2. Number of cases - The ratio of independent and dependent variables should be at least 20:1. For
every explanatory variable, there should be at least 20 samples. Atleast five samples are required in
extreme cases.
3. Missing data - Missing data in training data can make the model unfit for the sampled data.
4. Multicollinearity - If exploratory variables are highly correlated, the regression is vulnerable to bias.
Singularity leads to perfect correlation of 1.
The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method is also
known as ordinary least squares method.
In this method, the data points are modelled using a straight line. Any arbitrarily drawn line is not an
optimal line.
The vertical distance between each point and the line is called an error. These individual errors are
added to compute the total error of the predicted line. This is called sum of residuals.
The squares of the individual errors can also be computed and added to give a sum of squared error.
The line with the lowest sum of squared error is called line of best fit.
In another words, OLS is an optimization technique where the difference between the data points and
the line is optimized.
Mathematically, based on the line equations for points (x1, x2,...,xn) are:
E= ∑ ( ) = ∑ ( ( ))
Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled
out and are always positive, and sum of squares results in a large increase even for a small change in
the error.
Therefore, this is preferred for linear regression. Therefore, linear regression is modelled as a
minimization function as follows:
J ( a0, a1 ) = ∑ ( )
= ∑ ( ( ))
Here, J(a, a) is the criterion function of parameters a0 and a1. This needs to be minimized. This is
done by differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The
values of estimates of a0 and a1 are given as follows:
(̅̅̅) ( ̅ )( ̅)
(̅̅̅̅) ( ̅ )
And the value of a, is given as follows:
( ̅) ̅
Problems
Let us consider an example where the five week‟s sales data (in Thousands) is given as shown below.
Apply linear regression technique to predict the 7th and 9th week sales.
Solution
( ) ( )( )
( ) ( )
( )
The predicted 7th week sale would be (when x= 7), y = 0.54 + 0.66 *7 = 5.16 and 9th week sales would
be (when x=9), y = 0.54 + 0.66 *9 = 6.48
= * ++
[ ] [ ] [ ]
Problem
Find linear regression of the data of week and product sales (in Thousands) given. Use linear regression
in matrix form.
Xi (week) Yi (product Sales in Thousands)
1 1
2 3
3 4
4 8
Solution:
Here, the dependent variable X is be given as
XT = [1 2 3 4]
And the independent variable is given as follows:
YT = [1 3 4 8]
The data can be given in matrix form as follows:
X= [ ] Y=[ ]
=( )*( ) =( )
( )* [ ] =( ) ( )
Standard error estimate is another useful measure of regression. It is the standard deviation of the
observed values to the predicted values. This is given as:
∑( ̂)
√
Here yi is the observed value and ̂ is the predicted value. Here, n is the number of samples.
RMSE = √ =√ ∑ ( ̂)
Relative MSE
Relative MSE is the ratio of the prediction ability of the ̂ to the average of the trivial population. The
value of zero indicates that the model is prefect and its value ranges between 0 to 1. If the value is more
than 1, then the created model is not a good one.
∑ ( ̂)
RelMSE =
∑ ( ̅)
Coefficient of Variation
Coefficient of variation is unit less and is given as:
CV =
̅
Example:
Consider the following training set for predicting the sales of the items.
Consider two fresh item 6 and 7, whose actual values are 80 and 90 respectively. A regression model
predicts the values of the items 6 and 7 as 75 and 82 respectively. Find MAE, MSE, RMSE, RelMSE,
CV.
Solution:
Relative MSE
( ) ( )
RelMSE= = 0.1219
( ) ( )
Coefficient of Variation
CV = = 0.08