ml
ml
Machine Learning
Prepared
Dr. C Sunitha Ram
Dr N Kumaran
OBJECTIVES:
1. To introduce students to the basic concepts and techniques of Machine Learning.
2. To have a thorough understanding of the Supervised and Unsupervised learning
techniques
3. To study the various probabilities-based learning techniques
4. To understand graphical models of machine learning algorithms
PROGRAMME OUTCOME:
1. Apply basic principles and practices of computing grounded in mathematics and
science to successfully complete software related projects to meet customer business
Objective(s) and/or productively engage in research.
2. Apply their knowledge and skills to succeed in a computer science career and/or
obtain an advanced degree.
3. Demonstrate an ability to use techniques, skills, and modern computing tools to
implement and organize computing works under given constraints.
4. Demonstrate problem solving and design skills including the ability to formulate
problems and their solutions, think creatively and communicate effectively.
5. Develop software as per the appropriate software life cycle model.
6. Organize and maintain the information of an organization.
7. Exhibit teamwork, communication, and interpersonal skills which enable them to
work effectively with interdisciplinary teams.
8. Provide an excellent education experience through the incorporation of current
pedagogical techniques, understanding of contemporary trends in research and
technology, and hands-on laboratory experiences that enhance the educational
experience.
9. Demonstrate an ability to engage in life-long learning.
10. Function ethically and responsibly, and to remain informed and involved as full
participants in our profession and our society.
COURSE OUTCOMES:
Upon completion of the course, the students will be able to: Distinguish between,
supervised, unsupervised and semi-supervised learning
1. Apply the apt machine learning strategy for any given problem
2. Suggest supervised, unsupervised or semi-supervised learning algorithms for any
given problem
3. Design systems that uses the appropriate Trees in Probabilities Models of machine
learning
4. Modify existing machine learning algorithms to improve classification efficiency
5. Design systems that uses the appropriate graph models of machine learning
1
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10
CO1 L M M
CO2 M M H
CO3 L M M H H
CO4 L M H
CO5 L M M H H
UNIT – I INTRODUCTION
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron
– Design a Learning System – Perspectives and Issues in Machine Learning – Concept
Learning Task – Concept Learning as Search – Finding a Maximally Specific Hypothesis –
Version Spaces and the Candidate Elimination Algorithm – Linear Discriminates –
Perceptron – Linear Separability – Linear Regression.
2
TEXT BOOKS:
1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective‖, Second
Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series,
2014.
2. Tom M Mitchell, ―Machine Learning‖, First Edition, McGraw Hill Education, 2013.
REFERENCES:
1. Peter Flach, ―Machine Learning: The Art and Science of Algorithms that Make Sense
of Data‖, First Edition, Cambridge University Press, 2012.
2. Jason Bell, ―Machine learning – Hands on for Developers and Technical
Professionals‖, First Edition, Wiley, 2014
3. Ethem Alpaydin, ―Introduction to Machine Learning 3e (Adaptive Computation and
Machine Learning Series)‖, Third Edition, MIT Press, 2014
3
Introduction:-Data science is an inter-disciplinary is an inter-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract knowledge is an inter-disciplinary field that
uses scientific methods, processes, algorithms and systems to extract knowledge and insights from
many structural and unstructured data is an inter-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from many structural and
unstructured data. Data science is related to data mining(Data mining is a process of discovering
patterns in large data sets is a process of discovering patterns in large data sets), machine learning
is a process of discovering patterns in large data sets), machine learning and big data.
Unstructured data (or unstructured information) is information thateither does not have a
pre-defined data model) is information that either does not have a pre-defined data model or is
not organized in a pre-defined manner. Unstructured information is typically text) is
information that either does not have a pre-defined data model or is not organized
in a 3
What is Data?
➢ Data is often viewed as the lowest level of abstraction from which information
and knowledge are derived.
➢ Data can be numbers, words,measurements, observations or even
justdescriptions of things. Also, data is a representation of a fact, figure and idea.
➢ Data on its own carries no meaning. If data to be an information, it must
beinterpreted and take on a meaning.
An example of raw data table. It is just a collection of random info and data.
Machine learning (ML) is the study of computer algorithms that improveautomatically through
experience. It is seen as a subset of artificialintelligence)is the study of computer
algorithms that improve automatically through experience .It is seen as a subset of
artificialintelligence. Machine learning algorithms build a mathematical model) isthe study
of computer algorithms that improve automatically throughexperience. It is seen as a subset
of artificial intelligence. Machine learningalgorithms build a mathematical model based on
sample data, known as training data)is the study of computer algorithms that improve
automatically through experience. It is seen as a subset of artificial
intelligence. Machine learning algorithms build a mathematical model based on sample data
known as "training data" in order to make p s or decisions without being explicitly programmed
n
icto
red
to do so.Machine learning algorithms are used in a wide variety of applications,such
as email filtering) is the study of computer algorithms that improve automatically through
experience. It is seen as a subset of artificial6
4
Key Elements of Data Science
5
6
7
8
9
10
11
Types of Machine
Learning
C. Optimization
C. Optimization
D. Recommender System
E. Feature Analysis
F. Sentiment Analysis
12
Supervised learning
13
14
15
Application of supervised learning
unsupervised learning
▪ Unsupervised learning is very much the opposite of supervised learning. It features no labels.
Instead, our algorithm would be fed a lot of data and given the tools to understand the properties
of the data. Fromthere, it can learn to group, cluster, and/or organize the data in a way such that
a human (or other intelligent algorithm) can come in and make sense of the newly organized data.
16
17
Application of unsupervised learning
o Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely encountered a
video recommendation system. These systems are often times placed in the unsupervised domain.
We know things about videos, maybe their length, their genre, etc. We also know the watch history
of many users. Considering usersthat have watched similar videos as you and then enjoyed other
videos that you have yet to see, a recommender system can see this relationship in the data and
prompt you with such a suggestion.
o Buying Habits-Market based analysis: Buying habits -contained in a database somewhere and that
data is being bought and sold actively at this time. These buying habits can be used in unsupervised
learning algorithms to group customers into similar purchasing segments. This helps companies’
market to these grouped segments and can even resemble recommender systems.
o Grouping User Logs-semantic clustering: Less user facing, but still very relevant, we can use
unsupervised learningto group user logs and issues. This can help companies identify central themes
to issues their customers face and rectify these issues, through improving a product or designing an
FAQ to handle common issues.
18
Reinforcement of learning
• Reinforcement learning as learning from mistakes. Placea reinforcement learning algorithm
into any environmentand it will make a lotof mistakes in the beginning.
19
Application of Reinforcement learning
▪ Video Games: Google’s reinforcement learning application, AlphaZero and AlphaGo which
learned to play the game Go. Our Mario example is also a common example
▪ Industrial Simulation: For many robotic applications (think assemblylines), it is useful to have
our machines learn to complete their tasks without having to hardcode their processes.
• Resource Management: Reinforcement learning is good for navigating complex environments.
It can handle the need to balance certain requirements. For example, Google’s data centers.
•
20
Steps to solve a Machine Learning Problem
21
22
23
DESIGNING A LEARNING SYSTEM
Basic design issues and approaches to machine learning, let us consider designing a program to
learn to playcheckers, with the goal of entering it in the world checkers tournament.
Performance measure: the percent of games it wins in this world tournament.
1.2.1 Choosing the Training Experience
The first design choice is to choose the type of training experience from which our system will
learn. The type oftraining experience available can have a significant impact on success or failure
of the learner. (driving class to drive a car)
Three are three attributes which impact on success or failure of the learner
Whether the training experience provides direct or indirect feedback regarding the choices made
by theperformance system.
The degree to which the learner controls the sequence of training examples
How well it represents the distributing of examples over which the final system performance P
must be measured.
Example : Chess problem
1. Whether the training experience provides direct or indirect feedback regarding the choices
made by theperformance system.
For example in checkers game
• In learning to play checkers, the system might learn from direct training examples consisting of
individualcheckers board states and the correct move for each.
• Indirect training examples consisting of the move sequences and final outcomes of various
games played.
• The information about the correctness of specific moves early in the game must be inferred
indirectly from thefact that the game was eventually won or lost.
• Here the learner faces an additional problem of credit assignment, or determining the degree to
which eachmove in the sequence deserves credit or blame for the final outcome.
• Credit assignment can be a particularly difficult problem because the game can be lost even
when early movesare optimal, if these are followed later by poor moves
• Hence learning from direct training feedback is typically easier than learning from indirect
feedback
24
A checkers learning problem:
Task T: playing checkers
Performance measure P: percent of games won in the world tournament
Training experience E: games played against itself
In order to complete the design of the learning system, we must now choose
25
1. the exact type of knowledge to be, learned
2. a representation for this target knowledge
3. a learning mechanism
Let us therefore define the target value V(b) for an arbitrary board state b in
B, as follows:
26
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V(b) = V(b’), where b' is the bestfinal
board state that can be achieved starting from b and playing optimally until the
end of the game
1.2.2 Choosing a Representation for the Target Function
let us choose a simple representation: for any given board state, the
function c will be calculated as alinear combination of the following board
features:
• xl: the number of black pieces on the board
• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red (i.e., which can be captured on red's
next turn)
• X6: the number of red pieces threatened by black
Thus, our learning program will represent v(b) as a linear function of the form
where wo through w6 are numerical coefficients, or weights, to be chosen by the learning
algorithm. Learned values for the weights wl through W6 will determine the relative
importance of the various boardfeatures in determining the value of the board, whereas
the weight wo will provide an additive constant tothe board value
27
1.2.4.1 ESTIMATING TRAINING VALUES
□ assign the training value of Vtrain(b) for any intermediate board state b to be
V(successor(b), where v is the learner's current approximation to Vand where
Successor(b) denotes the next board state following b
28
• One such algorithm is called the least mean squares, or LMS training rule. For each
observed training example, it adjusts the weights a small amount inthe direction that reduces
the error on this training example.
29
1.2.4 The Final Design
o The final design of our checkers learning system can be naturally described
by four distinct program modules that represent the central components in
many learning systems.
o These four modules, summarized in Figure 1.1, are as follows:
o The Performance System is the module that must solve the given performance task,
in this case playing checkers, by using the learned targetfunction(s). It takes an instance
of a new problem (new game) as input and produces a trace of its solution (game
history) as output
o The Critic takes as input the history or trace of the game and produces asoutput a set
of training examples of the target function
o The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the specific
training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples. In our example, the Generalizer
corresponds to the LMS algorithm, and the output hypothesis is the function f
described by the learned weights wo, . . . , W6.
o The Experiment Generator takes as input the current hypothesis (currently
learned function) and outputs a new problem (i.e., initial board state) for the
Performance System to explore. Its role is to pick new practice problems that will
maximize thelearning rate of the overall system. In our example, the Experiment
Generator follows a very simple strategy: It always proposes the same initial
game board to begin a new game
30
Perspective &Issues in Machine Learning Perspective:
It involves searching a very large space of possible hypothesis to determine the onethat best fits the
observed data.
Issues:
o
Which algorithm performs best for which types of problems & representation?
o
How much training data is sufficient?
o
Can prior knowledge be helpful even when it is only approximately correct?
o
The best strategy for choosing a useful next training experience.
o
What specific function should the system attempt to learn?
How can learner automatically alter it’s representation to improve it’s ability torepresent and
o
learn the target function?
Concept Learning
• Inducing general functions from specific training examples is a main issue of machine
learning.
• Concept Learning: Acquiring the definition of a general category from given sample
positive and negative training examples of the category.
• Concept Learning can see as a problem of searching through a predefined space of
potential hypotheses for the hypothesis that best fits the training examples.
• The hypothesis space has a general-to-specific ordering of hypotheses, and the search
can be efficiently organized by taking advantage of a naturally occurring structure over the
hypothesis space.
Inferring a Boolean-valued function from training examples of its input and output.
31
• An example for concept-learning is the learning of bird-concept from the given
examples of birds (positive examples) and non-birds (negative examples).
• We are trying to learn the definition of a concept from given examples.
A Concept Learning Task – Enjoy SportTraining Examples
ATTRIBUTES CONCEPT
Hypothesis Representation
• A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
<0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days for which
EnjoySport=yes, describing this set by a conjunction of constraints over the instance
attributes.
EnjoySport Concept Learning Task
Given
Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
32
• Air-Temp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
Target Concept (Function) c : Enjoy-Sport : X {0,1}
Hypotheses H : Each hypothesis is described by a conjunction of constraints on the attributes.
Training Examples D : positive and negative examples of the target function
Determine
A hypothesis h in H such that h(x) = c(x) for all x in D.
The Inductive Learning Hypothesis
•Although the learning task is to determine a hypothesis h identical to the target concept cover the entire
set of instances X, the only information available about c is its value over the training examples.
Inductive learning algorithms can at best guarantee that the output hypothesis fits the target concept over
the training data.
Lacking any further information, our assumption is that the best hypothesis regarding unseen instances
is the hypothesis that best fits the observed training data. This is the fundamental assumption of inductive
learning.
•The Inductive Learning Hypothesis - Any hypothesis found to approximate the target function well
over a sufficiently large set of training examples will also approximate the target function well over
other unobserved examples.
•Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly
defined by the hypothesis representation.
•The goal of this search is to find the hypothesis that best fits the training examples.
•By selecting a hypothesis representation, the designer of the learning algorithm implicitly defines the
space of all hypotheses that the program can ever represent and therefore can ever learn.
•Sky has 3 possible values, and other 5 attributes have 2 possible values.
•There are 96 (= 3.2.2.2.2.2) distinct instances in X.
•There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
Two more values for attributes: ? and 0
•Every hypothesis containing one or more 0 symbols represents the empty set of instances; that is, it
classifies every instance as negative.
•There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
Only one more value for attributes: ?, and one hypothesis representing empty set of instances.
33
•Although Enjoy-Sport has small, finite hypothesis space, most learning tasks have much larger (even
infinite) hypothesis spaces.
We need efficient search algorithms on the hypothesis spaces.
General-to-Specific Ordering of Hypotheses
Many algorithms for concept learning organize the search through the hypothesis space by relying on a
general-to-specific ordering of hypotheses.
By taking advantage of this naturally occurring structure over the hypothesis space, we can design
learning algorithms that exhaustively search even infinite hypothesis spaces without explicitly
enumerating every hypothesis. Consider two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Now consider the sets of instances that are classified positive by hl and by h2. Because h2 imposes fewer
constraints on the instance, it classifies more instances as positive. In fact, any instance classified
positive by hl will also be classified positive by h2. Therefore, we say that h2 is more general than hl.
More-General-Than Relation
For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) = 1.
More-General-Than-Or-Equal Relation: Let h1 and h2 be two Boolean-valued functions defined over
X. Then h1 is more-general-than-or-equal-to h2 (written h1 ≥ h2) if and only if any instance that satisfies
h2 also satisfies h1. h1 is more-general-than h2 ( h1 > h2) if and only if h1≥h2 is true and h2≥h1 is false.
We also say h2 is more-specific-than h1.
34
be consistent with negative examples if the correct target concept is in H, and the training examples are
correct.
1.Initialize h to the most specific hypothesis in H
2.For each positive training instance x for each attribute constraint a, in h
3.If the constraint a, is satisfied by x Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
4. Output hypothesis h
Although FIND-S will find a hypothesis consistent with the training data, it has no way to determine
whether it has found the only hypothesis in H consistent with the data (i.e., the correct target
concept), or whether there are many other consistent hypotheses as well.
We would prefer a learning algorithm that could determine whether it had converged and, if not, at
least characterize its uncertainty regarding the true identity of the target concept.
In case there are multiple hypotheses consistent with the training examples, FIND-S will find the
most specific.
It is unclear whether we should prefer this hypothesis over, say, the most general, or some other
hypothesis of intermediate generality.
In most practical learning problems there is some chance that the training examples will contain at
least some errors or noise.
35
Such inconsistent sets of training examples can severely mislead FIND-S, given the fact that it ignores
negative examples.
We would prefer an algorithm that could at least detect when the training data is inconsistent and,
preferably, accommodate such errors.
In the hypothesis language H for the Enjoy-Sport task, there is always a unique, most specific
hypothesis consistent with any set of positive examples.
However, for other hypothesis spaces there can be several maximally specific hypotheses consistent
with the data.
In this case, FIND-S must be extended to allow it to backtrack on its choices of how to generalize the
hypothesis, to accommodate the possibility that the target concept lies along a different branch of
the partial ordering than the branch it has selected.
Candidate-Elimination Algorithm
•FIND-S outputs a hypothesis from H, that is consistent with the training examples, this is just one of
many hypotheses from H that might fit the training data equally well.
•The key idea in the Candidate-Elimination algorithm is to output a description of the set of all
hypotheses consistent with the training examples.
Candidate-Elimination algorithm computes the description of this set without explicitly enumerating
all of its members.
This is accomplished by using the more-general-than partial ordering and maintaining a compact
representation of the set of consistent hypotheses.
Consistent hypothesis:
•An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is a positive or
negative example of the target concept.
36
•However, whether such an example is consistent with h depends on the target concept, and in
particular, whether h(x) = c(x).
Version Spaces:
• This subset of all hypotheses is called the version space with respect to the hypothesis space
H and the training examples D, because it contains all plausible versions of the target concept.
List-Then-Eliminate Algorithm
•List-Then-Eliminate algorithm initializes the version space to contain all hypotheses in H, then
eliminates any hypothesis found inconsistent with any training example.
•The version space of candidate hypotheses thus shrinks as more examples are observed, until
ideally just one hypothesis remains that is consistent with all the observed examples.
37
Presumably, this is the desired target concept.
If insufficient data is available to narrow the version space to a single hypothesis, then the algorithm
can output the entire set of hypotheses consistent with the observed data.
It has many advantages, including the fact that it is guaranteed to output all hypotheses consistent
with the training data.
•A version space can be represented with its general and specific boundary sets.
•The Candidate-Elimination algorithm represents the version space by storing only its most general
members G and its most specific members S.
•Given only these two sets S and G, it is possible to enumerate all members of a version space by
generating hypotheses that lie between these two sets in general-to-specific partial ordering over
hypotheses.
38
• A version space with its general and specific boundary sets.
• The version space includes all six hypotheses shown here, but can be represented more
simply by S and G.
Candidate-Elimination Algorithm
• The Candidate-Elimination algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
• It begins by initializing the version space to the set of all hypotheses in H; that is, by
initializing the G boundary set to contain the most general hypothesis in H
G0 { <?, ?, ?, ?, ?, ?> }
and initializing the S boundary set to contain the most specific hypothesis S0 { <0, 0, 0, 0, 0, 0> }
• These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than S0 and more specific than G0.
• As each training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found inconsistent
with the new training example.
• After all examples have been processed, the computed version space contains all the
hypotheses consistent with these examples and only these hypotheses.
– If d is a positive example
Remove s from S
39
h is consistent with d, and some member of G is more general than h
Remove from S any hypothesis that is more general than another hypothesis in S
If d is a negative example
Remove g from G
Remove from G any hypothesis that is less general than another hypothesis in G
40
•Given that there are six attributes that could be specified to specialize G2, why are there only three
new hypotheses in G3?
•For example, the hypothesis h = <?, ?, Normal, ?, ?, ?> is a minimal specialization of G2 that correctly
labels the new example as a negative example, but it is not included in G3. The reason this hypothesis
is excluded is that it is inconsistent with S2. The algorithm determines this simply by noting that h is
not more general than the current specific boundary, S2. In fact, the S boundary of the version space
forms a summary of the previously encountered positive examples that can be used to determine
whether any given hypothesis is consistent with these examples. The G boundary summarizes the
information from previously encountered negative examples. Any hypothesis more specific than G is
assured to be consistent with past negative examples
•The fourth training example further generalizes the S boundary of the version space. It also results
in removing one member of the G boundary, because this member fails to cover the new positive
example. To understand the rationale for this step, it is useful to consider why the offending
hypothesis must be removed from G. Notice it cannot be specialized, because specializing it would not
make it cover the new example. It also cannot be generalized, because by the definition of G, any
more general hypothesis will cover at least one negative training example. Therefore, the hypothesis
must be dropped from the G boundary, thereby removing an entire branch of the partial ordering
from the version space of hypotheses remaining under consideration
41
• After processing these four examples, the boundary sets S4 and G4 delimit the version space
of all hypotheses consistent with the set of incrementally observed training examples.
• This learned version space is independent of the sequence in which the training examples are
presented (because in the end it contains all hypotheses consistent with the set of examples).
• As further training data is encountered, the S and G boundaries will move monotonically closer
to each other, delimiting a smaller and smaller version space of candidate hypotheses.
•The version space learned by the Candidate-Elimination Algorithm will converge toward the
hypothesis that correctly describes the target concept, provided There are no errors in the training
examples, and there is some hypothesis in H that correctly describes the target concept.
The algorithm removes the correct target concept from the version space. S and G boundary sets
eventually converge to an empty version space if sufficient additional training data is available. Such
an empty version space indicates that there is no hypothesis in H consistent with all observed training
examples. A similar symptom will appear when the training examples are correct, but the target
concept cannot be described in the hypothesis representation. e.g., if the target concept is a
disjunction of feature attributes and the hypothesis space supports only conjunctive descriptions
•We have assumed that training examples are provided to the learner by some external teacher.
•Suppose instead that the learner is allowed to conduct experiments in which it chooses the next
instance, then obtains the correct classification for this instance from an external oracle (e.g., nature
or a teacher).
42
This scenario covers situations in which the learner may conduct experiments in nature or in which a
teacher is available to provide the correct classification. We use the term query to refer to such
instances constructed by the learner, which are then classified by an external oracle. Considering the
version space learned from the four training examples of the Enjoy-Sport concept.
What would be a good query for the learner to pose at this point?
The learner should attempt to discriminate among the alternative competing hypotheses in its
current version space.
Therefore, it should choose an instance that would be classified positive by some of these hypotheses,
but negative by others.
This instance satisfies three of the six hypotheses in the current version space.
If the trainer classifies this instance as a positive example, the S boundary of the version space can
then be generalized.
Alternatively, if the trainer indicates that this is a negative example, the G boundary can then be
specialized.
In general, the optimal query strategy for a concept learner is to generate instances that satisfy exactly
half the hypotheses in the current version space.
When this is possible, the size of the version space is reduced by half with each new example, and the
correct target concept can therefore be found with only log2 |VS| experiments.
•The version space learned by the Candidate-Elimination Algorithm will converge toward the
hypothesis that correctly describes the target concept provided. There are no errors in the training
examples, and there is some hypothesis in H that correctly describes the target concept.
The algorithm removes the correct target concept from the version space. S and G boundary sets
eventually converge to an empty version space if sufficient additional training data is available. Such
an empty version space indicates that there is no hypothesis in H consistent with all observed training
examples.
•A similar symptom will appear when the training examples are correct, but the target concept cannot
be described in the hypothesis representation. e.g., if the target concept is a disjunction of feature
attributes and the hypothesis space supports only conjunctive descriptions
We have assumed that training examples are provided to the learner by some external teacher.
Suppose instead that the learner is allowed to conduct experiments in which it chooses the next
instance, then obtains the correct classification for this instance from an external oracle (e.g., nature
or a teacher).
This scenario covers situations in which the learner may conduct experiments in nature or in which a
teacher is available to provide the correct classification.
43
We use the term query to refer to such instances constructed by the learner, which are then
classified by an external oracle.
•Considering the version space learned from the four training examples of the Enjoy-Sport concept.
What would be a good query for the learner to pose at this point?
•The learner should attempt to discriminate among the alternative competing hypotheses in its
current version space. Therefore, it should choose an instance that would be classified positive by
some of these hypotheses, but negative by others.
This instance satisfies three of the six hypotheses in the current version space. If the trainer classifies
this instance as a positive example, the S boundary of the version space can then be generalized.
Alternatively, if the trainer indicates that this is a negative example, the G boundary can then be
specialized. In general, the optimal query strategy for a concept learner is to generate instances that
satisfy exactly half the hypotheses in the current version space. When this is possible, the size of the
version space is reduced by half with each new example, and the correct target concept can therefore
be found with only log2 |VS| experiments.
•Even though the learned version space still contains multiple hypotheses, indicating that the target
concept has not yet been fully learned, it is possible to classify certain examples with the same degree
of confidence as if the target concept had been uniquely identified.
Instance A was is classified as a positive instance by every hypothesis in the current version space.
Because the hypotheses in the version space unanimously agree that this is a positive instance, the
learner can classify instance A as positive with the same confidence it would have if it had already
converged to the single, correct target concept. Regardless of which hypothesis in the version space
is eventually found to be the correct target concept, it is already clear that it will classify instance A as
44
a positive example. Notice furthermore that we need not enumerate every hypothesis in the version
space in order to test whether each classifies the instance as positive.
This condition will be met if and only if the instance satisfies every member of S. The reason is that
every other hypothesis in the version space is at least as general as some member of S. By our
definition of more-general-than, if the new instance satisfies all members of S it must also satisfy each
of these more general hypotheses.
Instance B is classified as a negative instance by every hypothesis in the version space. This instance
can therefore be safely classified as negative, given the partially learned concept. An efficient test for
this condition is that the instance satisfies none of the members of G. Half of the version space
hypotheses classify instance C as positive and half classify it as negative. Thus, the learner cannot
classify this example with confidence until further training examples are available. Instance D is
classified as positive by two of the version space hypotheses and negative by the other four
hypotheses. In this case we have less confidence in the classification than in the unambiguous cases
of instances A and B.
Still, the vote is in favour of a negative classification, and one approach we could take would be to
output the majority vote, perhaps with a confidence rating indicating how close the vote was.
•The Candidate-Elimination Algorithm will converge toward the true target concept provided it is
given accurate training examples and provided its initial hypothesis space contains the target concept.
•Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis?
•How does the size of this hypothesis space influence the ability of the algorithm to generalize to
unobserved instances?
•How does the size of the hypothesis space influence the number of training examples that must be
observed?
In Enjoy-Sport example, we restricted the hypothesis space to include only conjunctions of attribute
values. Because of this restriction, the hypothesis space is unable to represent even simple disjunctive
target concepts such as "Sky = Sunny or Sky = Cloudy."
• From first two examples S2 : <?, Warm, Normal, Strong, Cool, Change>
45
• This is inconsistent with third examples, and there are no hypotheses consistent with these
three examples
The obvious solution to the problem of assuring that the target concept is in the hypothesis space H
is to provide a hypothesis space capable of representing every teachable concept.
Our conjunctive hypothesis space is able to represent only 973of these hypotheses. A very biased
hypothesis space
•Let the hypothesis space H to be the power set of X. A hypothesis can be represented with
disjunctions, conjunctions, and negations of our earlier hypotheses.
The target concept "Sky = Sunny or Sky = Cloudy" could then be described as
NEW PROBLEM: our concept learning algorithm is now completely unable to generalize beyond the
observed examples.
three positive examples (xl,x2,x3) and two negative examples (x4,x5) to the learner.
Therefore, the only examples that will be unambiguously classified by S and G are the observed
training examples themselves.
Inductive Bias –
•A learner that makes no a priori assumptions regarding the identity of the target concept has no
rational basis for classifying any unseen instances.
•Inductive Leap: A learner should be able to generalize training data using prior assumptions in order
to classify unseen instances.
46
•The generalization is known as inductive leap and our prior assumptions are the inductive bias of the
learner.
•Inductive Bias (prior assumptions) of Candidate-Elimination Algorithm is that the target concept can
be represented by a conjunction of attribute values, the target concept is contained in the hypothesis
space and training examples are correct.
Inductive Bias:
The inductive bias of L is any minimal set of assertions B such that for any target concept c and
corresponding training examples Dc the following formula holds.
ROTE-LEARNER: Learning corresponds simply to storing each observed training example in memory.
Subsequent instances are classified by looking them up in memory. If the instance is found in memory,
the stored classification is returned. Otherwise, the system refuses to classify the new instance.
CANDIDATE-ELIMINATION: New instances are classified only in the case where all members of the
current version space agree on the classification. Otherwise, the system refuses to classify the new
instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent with the
training examples. It then uses this hypothesis to classify all subsequent instances.
Inductive Bias: the target concept can be represented in its hypothesis space, and all instances are
negative instances unless the opposite is entailed by its other know1edge.
•Concept learning can be seen as a problem of searching through a large predefined space of potential
hypotheses.
47
•The general-to-specific partial ordering of hypotheses provides a useful structure for organizing the
search through the hypothesis space.
•The CANDIDATE-ELIMINATION algorithm utilizes this general-to- specific ordering to compute the
version space (the set of all hypotheses consistent with the training data) by incrementally computing
the sets of maximally specific (S) and maximally general (G) hypotheses.
• Because the S and G sets delimit the entire set of hypotheses consistent with the data, they
provide the learner with a description of its uncertainty regarding the exact identity of the target
concept. This version space of alternative hypotheses can be examined
To determine which unseen instances can be unambiguously classified based on the partially learned
concept.
The CANDIDATE-ELIMINATION algorithm is not robust to noisy data or to situations in which the
unknown target concept is not expressible in the provided hypothesis space.
Inductive learning algorithms are able to classify unseen examples only because of their implicit
inductive bias for selecting one consistent hypothesis over another.
If the hypothesis space is enriched to the point where there is a hypothesis corresponding to every
possible subset of instances (the power set of the instances), this will remove any inductive bias from
the CANDIDATE-ELIMINATION algorithm .
Unfortunately, this also removes the ability to classify any instance beyond the observed training
examples. An unbiased learner cannot make inductive leaps to classify unseen examples.
48
Linear Discriminant analysis
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Linear Regression Analysis
• Linear regression that attempts to show the relationship between two variables with the
linear equation.
• The simples form of a simple linear regression equation with one dependent and one
independent variable is represented by
67
68
• Product price- Sale
• Height-Weight
• Process- RAM
69
70
71
72
73
74
75
76
77
1. Why was Machine Learning Introduced? ...
2. What are Different Types of Machine Learning algorithms? ...
3. What is Supervised Learning? ...
4. What is Unsupervised Learning? ...
5. How is neuroscience related to machine learning?
6. How does machine learning work similar to a brain?
7. What is the main difference between human brain and a computer?
8. What is concept learning task in machine learning?
9. What is hypothesis concept learning?
10. What are the objectives of machine learning?
11. What is the goal of concept learning Search task?
12. Solved Numerical Example (Candidate Elimination Algorithm):
Example Citations Size InLibrary Price Editions Buy
78
UNIT – II LINEAR MODELS
1
Multi – Layer Perceptron in Practice
To solve real problems, MLP find solutions to four different types of problem: regression, classification, time-
series prediction, and data compression.
• +1s -> bias nod es, which is adjustable weights. Huge number of adjustable parameters that we need
to set during the training phase.
• Setting the values of adjustable weights is the job of the back-propagation algorithm, which is driven
by the errors coming from the training data.
• More training data is the better for learning, algorithm takes to learn increases.
• Minimum amount of data required is, it depends on the problem.
• Use a number of training examples is 10 times the number of weights.
• If MLP has large number of examples, so neural network training has expensive operation
• Two hidden layers is the need for normal MLP learning. This result can be strengthened by showing
mathematically that one hidden layer with lots of hidden nodes is sufficient. This is known as the
Universal Approximation Theorem
• Training networks with different numbers of hidden nodes and then choosing the one that gives the
best results.
• Using back-propagation algorithm for a network with as many layers, it harder to keep track of which
weights are being update.
• The basic idea is that by combining sigmoid functions we can generate ridge-like functions, and by
combining ridge-like functions, generate functions with a unique maximum.
• By combining these and transforming them using another layer of neurons, we obtain localized
2
response (a ‘bump’ function), and any functional mapping can be approximated to arbitrary accuracy
using a linear combination of such bumps.
• Two hidden layers are sufficient to compute these bump functions for different inputs, and so if the
function learns (approximate) is continuous, the network can compute it.
• The training of the MLP requires that the algorithm runs over the entire data set many times, with the
weights changing as the network makes errors in each iteration.
• Set predefined N number of iteration, the network has overfitted (Overfitting (overtraining): when the NN
learns too many I/O examples it may end up memorizingthe training data) or learn sufficiently and when
it stops, some predefined minimum error is reached that means algorithm never terminates. Sum of squares
errors during training
• At some stage the error on the validation set will start increasing again, because the network has stopped
learning about the function that generated the data, andstarted to learn about the noise. At this stage we
stop the training. This technique is called early stopping
3
4
A Regression Problem: Find the values of any inputs and train the data Function –sin wave
5
Train an MLP on the data. There is one input value, x and one output value t, so the neural network will
have one input and one output. Before getting started, we need to normalise the data, and then
separate the data into training, testing, and validation sets. In given example there are only 40
datapoints and use half of them as the training set Split the data in the ratio 50:25:25 by using the odd-
numbered elements as training data, the even-numbered ones that do not divide by 4 for testing, and
the rest for validation Construct a network with three nodes in the hidden layer, and run it for 101
iterations with a learning rate of 0.25. The output: Iteration: 0 Error: 12.3704163654
Iteration: 100 Error: 8.2075961385 so that the network is learning, since the error is decreasing.
To do two things: how many hidden nodes we need, and decide how long to train the network for.
In order to solve the first problem, need to test out different networks and see which get lower errors,
but to do that properly need to know when to stop training. Solve the second problem first, which is to
implement early stopping. keep track of the validation error and stop when it starts to increase.
6
• The validation error did not improve after that, and so early stopping found the correct point .
o Problem of finding the right size of network.
• First-use a single linear node for the output, y, and put some thresholds on theactivation value of that node.
For example, for a four-class problem,
• Close to a boundary, say y = 0.5? It belongs to class C3, close to the boundary in theoutput.
• A more suitable output encoding is called1-of-N encoding. e.g., (0,0,1,0) means that the correct resultis the
3rd class out of 4 Element yk of the output vector that is the largest element of y(in mathematical notation,
pick the yk for which yk > yj Aj ≠ k;A means for all.
➢ This is known as the hard-max activation function (since the neuron with the highest activation is
chosen to fire and the rest are ignored).
➢ Two-class classification, 90% of our data belongs to class 1. (This scan happens: for example, in
medical data, most tests are negativein general.)
➢ There is an alternative solution, known as novelty detection, which is to train the data on the data in
the negative class only, and to assume that anything that looks different to that is a positive example.
7
2.3.3 A Classification Example: The Iris Dataset
• Three types of iris (flower) by the length and width of the sepals and petals and is called iris.
• stext1 = ’Iris-setosa’ ,stext2 = ’Iris-versicolor’ ,stext3 = ’Iris-virginica’ .
• Need to separate the data into training, testing, and validationsets.
• There are 150 examples in the dataset, and they are split evenly amongst the three classes, so the three
classes are the same size.
• Split them into ½ training, and ¼ each testing and validation.50 are class1, 50class 2, etc.,
2.3.4 Time – Series Prediction
8
9
simplest form of a neural network used for the classification of patterns said to be linearly separable.
Basically, it consists of a single neuron with adjustable synaptic weights and bias.
10
If there is a solution to be found then the single layer perceptron learning algorithm will find it.
• It can separate classes that lie either side of a straight line easily.
• But in reality, division between classes are much more complex.
• Take for example the classical exclusive-or (XOR) problem.
• XOR logic function has two inputs and one output.
• It has limited set of functions Decision boundaries must be hyperplanes
• It can only perfectly separate linearly separable data
• We consider this as a problem in which we want the perceptron to learn to solve:
• Output 1 if x1 is on and x2 is off, or is x2 is on and x1 is off, otherwise output a 0.
• This appears a simple problem but there is no linear solution and this problem is linearly inseparable.
11
12
13
14
15
Multi-Layer Perceptron(MLP)
16
✓ These are generally known as going forwards and backwards through the network
17
18
Properties of architecture
• Number of hidden units per layer can be more or less than input or output units
19
1st layer draws linearboundaries 2nd layer combines the boundaries 3rd layer can generate arbitrarily complex
boundaries
20
2.1 Going Forward
• Start at the left by filling in the values for the inputs.
• Use these inputs and the first level of weights to calculate the activations of the hidden layer
• Use those activations and the next set of weights to calculate the activations of the output layer.
• Then got the outputs of the network, we can compare them to the targets and compute the error.
Biases
• Include a bias input to each neuron as such in Perceptron. by having an extra input that is
• permanently set to -1, and adjusting the weights to each neuron as part of the training.
• Thus, each neuron in the network (whether it is a hidden layer or the output) has 1 extra input, with
• fixed value.
2 .2 GOINGBACKWARDS: BACK-PROPAGATION OF ERROR
• Computing the errors at the output is no more difficult in Perceptron, but working out what to do
with those errors is more difficult.
• The method is called back-propagation of error, that the errors are sent backwards through the
network.
• The best way to describe back-propagation properly is mathematically, by choose an error function
• k: Ek = yk −tk for each neuron and tried to make it as small as possible.
• If it has only one set of weights in the network, it was sufficient to train the network.
• But, with the addition of extra layers of weights, this is harder to arrange.
• The problem is that try to adapt the weights of the Multi-layer Perceptron, it has to work out which
weights caused the error. This could be the weights connecting the inputs to the hidden layer, or the
weights connecting the hidden layer to the output layer.
• The error function that used for the Perceptron was where N is the number of output nodes.
If MLP has two errors,
1.The target is bigger than the output
2.The output is bigger than the target.
If these two errors are the same size, then add them up to get 0, which means there was no error.
• To get no errors make all errors have the same sign.
• It will be done in a few different ways, but the one that will turn out to be best is the sum-of-squares
error function, which calculates the difference between y and t for each node, squares them, and adds them
all together: ½ makes it easier when differentiate the function.
21
If differentiate a function, then it is called gradient of function, which is the direction along which it increases
and decreases the most.
If differentiate an error function, it gets the gradient of the error. Since the purpose of learning is to minimize
the error, following the error function downhill (in other words, in the direction of the negative gradient) .
Imagine a ball rolling around on a surface that looks like the line in Figure 4.3. Gravity will make the ball
roll downhill (follow the downhill gradient) until it ends up in the bottom of one of the hollows.
The places where the error is small, that algorithm is called gradient descent.
Differentiate with respect to three things in the network that change:
The inputs
The activation function that decides whether or not the node fires
The weights.
• It saturates(reaches its constant values) at ± 1 instead of 0 and1.
• It also has a relatively simple derivative: d/dx tanh x = (1−tanh2(x)).
• It can convert between the two easily, because if the saturation points are (±1), then it can convert
to (0,1) by using 0.5×(x+1).
New form of error computation and new activation function decided whether or not a neuron should fire.
If change the weights means improving the error function of the network.
Fed inputs forward through the network and worked out which nodes are firing.
At the output, computed the errors as the sum squared difference between the outputs and the target.
When the output is computing the gradient of these errors and to decide how much update each weight in
the network. Inputs connected to the output layer and after updated means, it will work backwards through
the network until get back to the inputs again.
It raises two problems
For the output neurons, don’t know which input.
For the hidden neurons, don’t know the target.
22
The Multi-layer Perceptron Algorithm
Assume
N output nodes
i,j,k to index the nodes in each layer in the sums, and the corresponding Greek letters (ι,ζ,κ) for
fixed indices
23
The Multi-layer Perceptron Algorithm
1.An input vector is put into the input nodes 2. the inputs are fed forward through the network (Figure 4.6)
2.The inputs and the first-layer weights (here labelled as v) are used to decide whether the hidden nodes
fire or not. The activation function g(·) is the sigmoid function given in Equation (4.2)above. The outputs
of these neurons and the second-layer weights (labelled as w)are used to decide if the output neurons fire
or not
3.Error is computed as the sum-of-squares difference between the network outputs and the targets
24
Initialising method:
weights are initialized to small random numbers, both positive and negative.
If the initial weights values are close to 1 or -1 then the inputs to the sigmoid are also likely to be close to
±1 and so the output of the neuron is either 0 or 1 (the sigmoid has saturated, reached its maximum or
minimum value).
If the weights are very small (close to zero) then the input is still close to 0 and so the output of the neuron
is just linear, so gets a linear model.
Input to the neuron will be w√n, where w is the initialization value of the weights.
Set the weights in the range −1/√n < w < 1/√n, where n is the number of nodes in the input layer.
25
3 Different Output Activation Functions
Sigmoid neurons in hidden and output layer- 0 and 1 ,Regression problem- continuous range
Soft-max activation – 1 of N output encoding. The soft-max function rescales the outputs by calculating
the exponential of the inputs to that neuron and dividing by the total sum of the inputs to all of the neurons,
so that the activations sum to 1 and lie between 0 and 1.
26
2.2.4 Sequential and Batch Training
All of the training examples are presented to the neural network, the average sum-of-squares error is then
computed, and this is used to update the weights.
Thus there is only one set of weight updates for each epoch (pass through all the training examples).
This means that only update the weights once for each iteration of the algorithm, which means that the
weights are moved in the direction that most of the inputs want them to move, rather than being pulled
around by each input individually.
The batch method performs a more accurate estimate of the error gradient, and will thus converge to the
local minimum more quickly
The learning rule is the minimisation of the network error by gradient descent (using the derivative of the
error function to make the error smaller).
Perform an optimisation-adapting the values of the weights in order to minimise the error function.
A local minimum of a function is a point where the function value is smaller than at nearby points, but
possibly
• Neural network learning by adding in some contribution from the previous weight change that made to
the current one
•Benefit to momentum: Use a smaller learning rate, which means that the learning is more stable.
• Weight decay- reduces the size of the weights as the number of iterations increases. This gives better
result to lead a network
ie., small weights are closer to linear greater than at a distant point.
A global minimum is a point where the function value is smaller than at all other feasible points.
27
28
Minibatches and Stochastic Gradient Descent
Batch algorithm converges to a local minimum faster than the sequential algorithm, which computes the
error for each input individually and then does a weight update, but latter stuck in local minima. The idea
of a minibatch method is by splitting the training set into random batches, estimating the gradient based on
one of the subsets of the training set, performing a weight update, and then using the next subset to estimate
a new gradient and using that for the weight update, until all of the training set has been used. The training
set are then randomly shuffled into new batches and the next iteration takes place. A more extreme version
of the minibatch idea is to use just one piece of data to estimate the gradient at each iteration of the
algorithm, and to pick that piece of data uniformly at random from the training set. So a single input vector
is chosen from the training set, and the output and hence the error for that one vector computed, and this is
used to estimate the gradient and so update the weights. A new random input vector (which could be the
same as the previous one) is then chosen and the process repeated. This is known as stochastic gradient
descent. t It is often used if the training set is very large, since it would be very expensive to use the whole
dataset to estimate the gradient in that case.
29
Other Improvements
One is to reduce the learning rate as the algorithm progresses. The network making large-scale changes to
the weights at the beginning, when the weights are random. Results gives larger performance gains the
second derivatives of the error with respect to the weights. In the back-propagation algorithm - use the first
derivatives to drive the learning. Knowledge of the second derivatives , it helps to improve the network
Overview
For this tutorial, we’re going to use a neural network with two inputs, two hidden neurons, two
output neurons. Additionally, the hidden and output neurons will include a bias.
Here’s the basic structure:
30
The goal of backpropagation is to optimize the weights so that the neural network can learn
howto correctly map arbitrary inputs to outputs.
Single training set:
• Inputs 0.05 and 0.10
• To begin, let’s see what the neural network currently predicts given the weights
andbiases above and inputs of 0.05 and 0.10.
• To do this we’ll feed those inputs forward though the network.
31
We figure out the total net input to each hidden layer neuron, squash the total net input using
an activation function (here we use the logistic function), then repeat the process with the
output layer neurons.
We repeat this process for the output layer neurons, using the output from the hidden layer
neurons as inputs.
32
Calculating the Total Error
Calculate the error for each output neuron using the squared error function and sum them to get
the total error:
Some sources refer to the target as the ideal and the output as the actual.
The is included so that exponent is cancelled when we differentiate later on. The
result is eventually multiplied by a learning rate anyway so it doesn’t matter that we
introduce a constant here [1].
For example, the target output for is 0.01 but the neural network output
0.75136507, therefore its error is:
Repeating this process for (remembering that the target is 0.99) we get:
The total error for the neural network is the sum of these errors:
33
The Backwards Pass
Our goal with backpropagation is to update each of the weights in the network so that they
causethe actual output to be closer the target output, thereby minimizing the error for each
output neuron and the network as a whole.
Output Layer
Consider . We want to know how much a change in affects the total error,
34
The Backwards Pass
Our goal with backpropagation is to update each of the weights in the network so that they
causethe actual output to be closer the target output, thereby minimizing the error for each
output neuron and the network as a whole.
Output Layer
Consider . We want to know how much a change in affects the total error,
35
Putting it all together:
You’ll often see this calculation combined in the form of the delta rule:
Therefore:
Some sources extract the negative sign from so it would be written as:
To decrease the error, we then subtract this value from the current weight(optionally multiplied
by some learning rate, eta, which we’ll set to 0.5):
36
Some sources use (alpha) to represent the learning rate, others use (eta),
and others even use (epsilon).
We perform the actual updates in the neural network after we have the new weights leading into
the hidden layer neurons (ie, we use the original weights, not the updated weights, when we
continue the backpropagation algorithm below).
Hidden Layer
Next, we’ll continue the backwards pass by calculating new values for ,
, .
, and
Visually:
37
We’re going to use a similar process as we did for the output layer, but slightly different to
account for the fact that the output of each hidden layer neuron contributes to the output (and
therefore error) of multiple output neurons. We know that affects both and
therefore
the needs to take into consideration its effect on the both output neurons:
Starting with :
38
And is equal to :
Therefore:
39
We calculate the partial derivative of the total net input to with respect tothe same as we
did for the output neuron:
Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1
inputs originally, the error on the network was 0.298371109. After this first round
40
of backpropagation, the total error is now down to 0.291027924. It might not seem
like much, but after repeating this process 10,000 times, for example, the error
plummets to 0.0000351085. At this point, when wefeed forward 0.05 and 0.1, the
two outputs neurons generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs
0.99 target).
1 Introduction
For the purpose of this derivation, we will use the following notation:
• The subscript k denotes the output layer.
41
Linear models
■ RBF Network
■ Curse of Dimensionality
An RBFN performs classification by measuring the input’s similarity to examples from the
training set. Each RBFN neuron stores a prototype, which is just one of the examples from the
training set.
When we want to classify a new input, each neuron computes the Euclidean distance between
the input and its prototype. Ie., If the input more closely resembles the Class A prototypes
than the Class B prototypes, it is classified as Class A.
Classification:
Training: Previous examples of each class. Output: A class out of a discrete set of classes.
Classification problems can be made to look like nonparametric regression.
42
Hyperplane- linearly separable
■ Input layer should mapping with hidden layer , there they provide a set of functions
which forms a based for mapping into the hidden layer space.
■ To do mapping from input space to the hidden layer , it is need some basis functions
providing the neurons in the hidden layer. Hidden neurons in hidden layer providing the basis
functions and this kind of architecture is called Radial Basis function networks.
◦ F(x) = ∑ wi h(x)
■ Three layers
43
◦ Input layer – Source nodes that connect to the network to its environment /
hidden layer. (Non linear mapping)
44
45
46
47
48
What happens in Hidden layer?
The patterns in the input space form clusters. If the centres of these clusters are known then
the distance from the cluster centre can be measured. The most commonly used radial basis
function is a Gaussian function. The other function are multi quadrics function and
inverse multi quadric function. In an RBF network r is the distance from the cluster
centre.
49
50
51
52
53
54
Curse of Dimensionality
As the Number of features or dimension grows. The amount of data we need to generate
accurately grows exponentially. Feature selection and feature engineering. The dimension also
called features. The features independent features or target output features. Features basically
attributes
55
56
Model learn more features exponentially get confusion. Once it reach threshold value, The
accuracy not changed. If feature is increasing exponentially from 100 to 200 or 1000. The
accuracy is decreased is called curse of dimensionality,
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression.
57
But generally, they are used in classification problems. In 1960s, SVMs were first introduced
but later they got refined in 1990. SVMs have their unique way of implementation as
compared to other machine learning algorithms. Lately, they are extremely popular because
of their ability to handle multiple continuous and categorical variables.
Working of SVM
The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane
(MMH).
58
•Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
•Hyperplane − As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.
•Margin − It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.
Support Vector Machine is a supervised learning method , it is a discriminative classifier that
is formally designed by a separative hyperplane.
It is a representation of examples as points in space that are mapped so that the points of
different categories are separated by a gap as SVM is the extreme points in the dataset
Hyperplane is the maximum distance to the support vectors of any class.
59
60
61
• From the distance margin – it get the optimal hyperplane.
• Based on the hyperplane , it can say the new data point belongs to male gender.
62
• If select a hyperplane having low margin then there is high chance of misclassification
63
64
65
w x + b<0
66
Support Vectors with the maximum are those margin. datapoints that the margin This is the
pushes up simplest kind of against SVM (Called an LSVM)Linear SVM
■ We can formulate a Quadratic Optimization Problem and solve for w and b What if
the training set is noisy? - Solution 1: use very powerful kernels
OVERFITTI NG!
denotes +1
denotes -1
67
Hard Margin v.s. Soft Margin
■ The new formulation incorporating slack variables: Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)} yi (wTxi + b) ≥ 1- ξi and ξi ≥
0 for all i
Non-linear SVMs
■ Datasets that are linearly separable with some
noise work out great:
x
68
■ But what are we going to do if the dataset is just
too hard?
0 x
space: x2
0 x
■ General idea: the original input space can always be mapped to some higher-dimensional
feature space where the training set is separable: Nonlinear SVM - Overview
■ SVM locates a separating hyperplane in the feature space and classify points in that space
■ It does not need to represent the space explicitly, simply by defining a kernel function
■ The kernel function plays the role of the dot product in the feature space.
SVM Applications
Weakness of SVM
It is sensitive to noise
A relatively small number of mis labelled examples can dramatically decrease the
performance. It only considers two classes, how to do multi-class classification with SVM?
69
Answer:
2)To predict the output for a new input, just predict with each SVM and find out which one
puts the prediction the furthest into the positive region.
Task: The classification of natural text (or hypertext) documents into a fixed number of
predefined categories based on their content. email filtering, web searching, sorting
documents by topic, etc. A document can be assigned to more than one category, so this can
be viewed as a series of binary classification problems, one for each category
Representation of Text
IR’s vector space model (aka bag-of-words representation) ■ A doc is represented by a vector
indexed by a pre-fixed set or dictionary of terms
The distance between two documents is φ(x)·φ(z) K(x,z) = 〈φ(x)·φ(z) is a valid kernel,
SVM can be used with K(x,z) for discrimination.
Why SVM?
70
Choice of kernel
Gaussian or polynomial kernel is default if ineffective, more elaborate kernels are needed
domain experts can give assistance in formulating appropriate similarity measures Choice of
kernel parameters. e.g. σ in Gaussian kernel, σ is the distance between closest points with
different classifications . In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
• Epoch — an arbitrary cut off, generally defined as “one pass over the entire dataset”, used
to separate training into distinct phases, which is useful for logging and periodic
evaluation. In layman’s term, a number of epochs means how many times you go through
your training set.
• Learning Rate — “a scalar used to train a model via gradient descent. During each
iteration, the gradientdescent algorithm multiplies the learning rate by the gradient. The
resulting product is called the gradient step. Learning rate is a key hyperparameter.”
Specifying the learning rate is equivalent to determining how fast weights change for each
iteration. In Tensorflow playground, the learning rate ranges from 0.00001 to 10.
• Activation Function — the output of that node, or “neuron,” given an input or set of
inputs. This output is then used as input for the next node and so on until a desired solution
to the original problem is found. Available activation functions in Tensorflow playground
are ReLU, Tanh, Sigmoid, and Linear.
• Regularization — a hyperparameter to prevent overfitting. Available values are L1 and
L2. L1 computes the sum of the weights, whereas L2 takes the sum of the square of
the weights.
71
• Regularization Rate — a scalar used to specify the rate at which the model applies the
regularization, ranging from 0 to 10.
• Problem Type — classification (categorical output) vs. regression (numerical output)
• Ratio of the Training and Testing Sets — the proportion of a subset to train a model and
a subset to test a model. I usually set it to 80/20
• Noise — a distortion in data that is construed to be extraneous to the original data.
• Batch Size — “a small, randomly selected subset of the entire batch of examples run
together in a single iteration of training or inference. The batch size of a mini-batch is
usually between 10 and 1,000.”
• Features — represents an input layer to feed in.
• Hidden Layer — a layer in between input layers and output layers, where artificial
neurons take in a set of weighted inputs and produce an output through an activation
function. In this context, you can specify as many as you want, but bear in mind that the
more hidden layer you add, the more complex the model becomes.
• Output — an output layer in the neural network, often involving the loss evaluation. Loss
function (or a cost function) is a method of evaluating how well the neural network
performs in the given data. If predictions deviates too much from actual results, loss
function will be high. We often evaluate the losses both on training and testing sets.
Figure 2: represents an artificial neural network (ANN) with multiple layers between the
input and output layers. For example, given input data of image pixels from MNIST dataset,
we can specify 2 hidden layers, each of which has 4 hidden neurons. Ultimately, we have the
predicted probabilities of the possible number for the given image. Image Source: Deep
learning — Convolutional neural networks and feature extraction with Python, Perone (2015)
72
UNIT – III TREE AND PROBABILISTIC MODELS
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine
Classifiers – Probability and Learning – Data into Probabilities – Basic Statistics – Gaussian
Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means Algorithms
– Vector Quantization – Self Organizing Feature Map.
A tree has many analogies in real life, and turns out that it has influenced a wide area of machine
learning, covering both classification and regression. In decision analysis, a decision tree can be
used to visually and explicitly represent decisions and decision making. As the name goes, it uses
a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy
to reach a particular goal, its also widely used in machine learning.
Classification:
Problems with categorical solutions like ‘yes’ or ‘No’ ,’True’ or ‘False’,’1’ or ‘0’.
Regression:
Problems wherein continuous value needs to be predicted like ‘Product Prices’, ’Profit’.
Clustering:
Problems wherein the data needs to be organized to find specific patterns like in the case of
‘Product’ Recommendation
Classification is the process of dividing the datasets into different categories or groups by
adding label. Ie., It adds the data point to a particular labelled group on the basis of some
condition.
Types of Classification
◦ Decision Tree
◦ Random Forest
◦ Naïve Bayes
◦ K Nearest Neighbour
◦ Logistic Regression
A classification tree will determine a set of logical if then conditions to classify problems. For
example, discriminating between three types of flowers based on certain features.
It is a tree shaped diagram used to determine a course of action. Each branch of the tree
represents a possible decision, occurrence or relation.
Little effort for data preparation and less requirement of data cleaning
Disadvantages
Over fitting occurs when the algorithm captures noise in the data
High variance- The model can get unstable due to small variation in data
Low biased tree- A highly complication DT tends to have a low bias which makes it
difficult for the model to work with new data
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
ID3-Iterative Dichotomiser 3
Entropy function
Information Gain
C4.5
Only CART and ID3 algorithms as they are the ones majorly used.
A Decision tree is tree which each node represents a Feature(Attribute) , each link (branch)
represents a Decision (Rule) and each leaf represents an outcome.
How can an algorithm be represented as a tree?
For this let’s consider a very basic example that uses titanic data set for predicting whether a
passenger will survive or not. Below model uses 3 features/attributes/columns from the data set,
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique
which is called as Attribute selection measure or ASM.
By this measurement, it can easily select the best attribute for the nodes of the tree.
Information Gain-ID3
Gini Index-CART
Entropy-ID3
Gain Ratio-C4.5
Reduction in Variance-C4.5
Chi-Square-CHAID
Information Gain:
According to the value of information gain, split the node and build the decision tree.
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
IG(T,X)=Entropy(T)-Entropy(T,X)
IG(PlayGolf,Outlook)=E(PlayGolf)-E(PlayGolf,Outlook)
= 0.940-0.693=0.247
Where,
P(no)= probability of no
Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
• Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q) (p²+q²).
• Calculate the Gini index for split using the weighted Gini score of each node of that split.
CART (Classification and Regression Tree) uses the Gini index method to create split
points.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.
It is a decision tree where each fork is a split in a predictor variable and each node at the
end has a prediction for the target variable.
The CART algorithm is an important decision tree algorithm that lies at the foundation of
machine learning.
Moreover, it is also the basis for other powerful machine learning algorithms like bagged
decision trees, random forest and boosted decision trees.
The Classification and regression tree(CART) methodology is one of the oldest and most
fundamental algorithms. It is used to predict outcomes based on certain predictor variables.
Example 2: Decision tree is based on numeric data. If a person is driving above 80kmph, we can
consider it as over-speeding, else not.
Example 3: If a person is driving above 80kmph, we can consider it as over-speeding, else not.Here
is one more simple decision tree. This decision tree is based on ranked data, where 1 means the
speed is too high, 2 corresponds to a much lesser speed. If a person is speeding above rank 1 then
he/she is highly over-speeding. If the person is above speed rank 2 but below speed rank 1 then
he/she is over-speeding but not that much. If the person is below speed rank 2 then he/she is driving
well within speed limits.
The classification in a decision tree can be both categorical or numeric which is used in
Bioinformatics
The tree is called the root node . The nodes in between are called internal nodes. Internal
nodes have arrows pointing to them and arrows pointing away from them. The end nodes
are called the leaf nodes or just leaves. Leaf nodes have arrows pointing to them but no
arrows pointing away from them.
In the above diagrams, root nodes are represented by rectangles, internal nodes by circles,
and leaf nodes by inverted-triangles.
In the example given, build a decision tree that uses chest pain, good blood circulation, and
the status of blocked arteries to predict if a person has heart disease or not.
There are two leaf nodes, one each for the two outcomes of chest pain. Each of the leaves
contains the no. of patients having heart disease and not having heart disease for the corresponding
entry of chest pain
The same thing for good blood circulation and blocked arteries.
Good blood circulation as the root node Blocked arteries as the root node
The 3 features separates the patients having heart disease from the patients not having heart
disease perfectly. It is to be noted that the total no. of patients having heart disease is
different in all three cases. This is done to simulate the missing values present in real-world
datasets.
Because none of the leaf nodes is either 100% ‘yes heart disease’ or 100% ‘no heart
disease’, they are all considered impure. To decide on which separation is the best, we need
a method to measure and compare impurity.
The metric used in the CART algorithm to measure impurity is the Gini impurity score.
Calculating the Gini impurity for chest pain for the left leaf,
Similarly, calculate the Gini impurity for the right leaf node.
The leaf nodes do not represent the same no. of patients as the left leaf represents 144
patients and the right leaf represents 159 patients. Thus the total Gini impurity will be the
weighted average of the leaf node Gini impurities.
Similarly the total Gini impurity for ‘good blood circulation’ and ‘blocked arteries’ is
calculated as
‘Good blood circulation’ has the lowest impurity score among the tree which symbolizes that it
best separates the patients having and not having heart disease, so we will use it at the root node.
Steps to be repeated on the left side
• If the node itself has the lowest score, then there is no point in separating the patients
anymore and it becomes a leaf node.
• If separating the data results in improvement then pick the separation with the lowest
impurity value.
Complete Decision tree
Some examples
Classification trees are used when Regression trees, on the other hand, are used when
the dataset needs to be split into classes the response variable is continuous. For instance, if the
which belong to the response variable. In response variable is something like the price of a property
many cases, the classes Yes or No. or the temperature of the day, a regression tree is used.
classification trees are used for Regression trees are used for prediction-type
classification-type problems. problems
A classification tree splits the dataset In a regression tree, a regression model is fit to the
based on the homogeneity of data. target variable using each of the independent variables.
Measures of impurity like entropy or The error between the predicted values and actual
Gini index are used to quantify the values is squared to get “A Sum of Squared Errors”(SSE).
homogeneity of the data when it comes to The SSE is compared across the variables and the variable
classification trees. or point which has the lowest SSE is chosen as the split
point. This process is continued recursively.
◦ Linear Regression
◦ Logistics Regression
◦ Naive Bayes
◦ Decision Tree
◦ Voting
What is Classification
“Classification is the process of grouping things according to similar features they share”
Learn several simple models and combine their output to produce the final decision.
The combined strength of the models offsets individual model variances and biases.
This provides a composite prediction where the final accuracy is better than the accuracy
of individual models.
Significance
This model is the application of multiple models to obtain better performance than from a
single model
Robustness- Ensemble models incorporate the predictions from all the base learners
• Bootstrap replication:
o Given n training examples, construct a new training set by sampling n instances
with replacement
o Excludes 30% of the training instances
• Bagging:
o Create bootstrap replicates of training set
o Train a classifier(e.g., a decision tree ) for each replicate
o Estimate classifier performance using out-of-bootstrap data
o Average output of all classifiers
Labelled images of Hot DOGS and other objects
◦ Combine all weak learners to form an ensemble (or) Create an ensemble of well
chosen strong and diverse models
◦ This models gain more accuracy and robustness by combining data from numerous
modelling approaches
Bagging
• The idea behind bagging is combining the results of multiple models (for instance, all
decision trees) to get a generalized result.
• If create all the models on the same set of data and combine it, Whether these models will
give the same result ,since they are getting the same input. One of the techniques is
bootstrapping.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea
of the distribution (complete set). The size of subsets created for bagging may be less than
the original set.
• Multiple subsets are created from the original dataset, selecting observations with
replacement.
• The final predictions are determined by combining the predictions from all the models.
What is Random forest?
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on
the concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.
"Random Forest is a classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.
Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a better
Random forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
Random forest is the most used supervised machine learning algorithm for classification
and regression
RF uses ensemble learning method in which the predictions are based on the combined
results of various individual models
No Overfitting
High Accuracy
Random forest can maintain accuracy when a large proportion of data is missing
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
How Boosting Algorithm Works?
The basic principle behind the working of the boosting algorithm is to generate multiple
weak learners and combine their predictions to form one strong rule.
These weak rules are generated by applying base Machine Learning algorithms on different
distributions of the data set. These algorithms generate weak rules for each iteration.
After multiple iterations, the weak learners are combined to form a strong learner that will
predict a more accurate outcome.
The algorithm :
Step 1: The base algorithm reads the data and assigns equal weight to each sample
observation.
Step 2: False predictions made by the base learner are identified. In the next iteration, these
false predictions are assigned to the next base learner with a higher weightage on these
incorrect predictions.
Step 3: Repeat step 2 until the algorithm can correctly classify the output.
Bagging and Boosting are two of the most commonly used techniques in machine learning.
Following are the algorithms will be focusing on:
Bagging algorithms:
◦ Bagging meta-estimator
◦ Random forest
Boosting algorithms:
◦ AdaBoost(Adaptive Boosting)
◦ XGBM
◦ Light GBM
◦ CatBoost
Both make the final decision by averaging the N learners (or taking the majority of them
i.e Majority Voting).
There are mainly four sectors where Random forest mostly used:
Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.
It enhances the accuracy of the model and prevents the overfitting issue.
Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
Standard Deviation
The standard deviation or sd of a bunch of numbers tells you how much the individual
numbers tend to differ from the mean.
The sample standard deviation is the square root of the sample variance: sd = √ s². For
example, incomes deviate from their mean by $7201.
The population standard deviation is the square root of the population variance: sd= √ 𝜎².
Three different data distributions with same mean (100) and different standard deviation
(5,10,20)
The smaller the standard deviation, narrower the peak, the data points are closer to the mean.
The further the data points are from the mean, the greater the standard deviation.
• The Gaussian Mixture Model, or GMM for short, is a mixture model that uses a
combination of Gaussian (Normal) probability distributions and requires the estimation of
the mean and standard deviation parameters for each.
How do we model this distribution?
Forest images have highest value
3.4.1 EM algorithm
• In the EM algorithm, the estimation-step would estimate a value for the process latent
variable for each data point, and the maximization step would optimize the parameters of
the probability distributions in an attempt to best capture the density of the data.
• In the real world applications of machine learning, it is very common that there are many
relevant features available for learning but only a small subset of them are observable.
• The process is repeated until a good set of latent values and a maximum likelihood is
achieved that fits the data.
• Initially, a set of initial values of the parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data comes from
a specific model.
• The next step is known as “Expectation”- step or E-step. In this step, the observed data
in order to estimate or guess the values of the missing or incomplete data. It is basically
used to update the variables.
• The next step if known as “Maximization”- step or M-step. In this step, the complete data
generated in the preceding “Expectation”- step in order to update the values of the
parameters. It is basically used to update the hypothesis.
• In the next step, it is checked whether the values are converging or not, if yes then stop
other wise repeat step 2 and step3 ie., Expectation- step and Maximization step until the
convergence occurs.
How does it work?
Start
E-Step
M-step
Convergen
ce
Stop
Advantages
• During implementation, the E-step and M-step are very easy for many problems
Disadvantages
Uses
• It can be used for the purpose of estimating the parameters of Hidden Markov
Model(HMM)
• It can be used for discovering the values of latent variables
Simple Analogy..
• Tell me about your friends (who your neighbors are) and I will tell you who you are.
-Basic idea- IF it walk like a duck, quacks like a duck, then it’s probably duck
KNN – Different names
• K-Nearest Neighbors
• Memory-Based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Lazy Learning
K-nearest neighbors of a record x are data points that have the k smallest distance to x
Choosing k – increases k reduce variance, increases bias
NN smoothing
• Nearest neighbour methods can also be used for regression by returning the average
value of the neighbours to a point, or a spline or similar fit as the new value.
• The most common methods are known as kernel smoothers, and they use a kernel (a
weighting function between pairs of points) that decides how much emphasis(weight) to
put on to the contribution from each data point according to its distance from the input.
Both of these kernels are designed to give more weight to points that are closer to the current input,
with the weights decreasing smoothly to zero as they pass out of the range of the current input,
with the range specified by a parameter λ.
Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories.
• To solve this type of problem, need a K-NN algorithm. With the help of K-NN, can easily
identify the category or class of a particular dataset. Consider the below diagram:
kNN algorithm is one of the simplest supervised machine learning algorithm mostly used for
Classification- It classifies a data point based on how its neighbors are classified.
• K nearest neighbour is a simple algorithm that stores all the available cases and classifies
the new data or case based on a similarity measure.
• K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
Examples:
To classify a new input vector x, examine the k-closest training data points to x and assign the
object to the most frequently occurring class
kNN steps
• Handle Data: Open the dataset and split into test/train datasets
• The K-NN working can be explained on the basis of the below algorithm:
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each category.
• Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
• There is no particular way to determine the best value for "K", so need to try some values
to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
• Large values for K are good, but it may find some difficulties.
• It is simple to implement.
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points
for all the training samples.
Real time example
Compute centroid for the cluster using all points in the cluster.
K-Means algorithm for clustering
• Given a data set of items, with certain features and values for these features ,the algorithm
will categorize the items into k groups or clusters of similarity
• To calculate the similarity, we can use the Euclidean distance, Manhattan distance,
Hamming distance, Cosine distance as measurement.
• K-means is a clustering algorithm whose goal is to group similar elements or data points
into a cluster.
• It performs division of objects into clusters which are similar between them and are
dissimilar to the objects belonging to another cluster
• Clustering is the process of diving the datasets into groups, consisting of similar data points
- A web search engine often returns thousands of pages in response to a broad query, making
it difficult for users to browse or to identify relevant information.
- Clustering methods can be used to automatically group the retrieved documents into a list
of meaningful categories.
• Applications of K-means clustering
– Academic performance
– Diagnostic systems
– Search engines
• Strengths
– It is alos efficient, in which the time taken to cluster k-means rises linearly with the
number of data points.
• Weaknesses
For example
Among these points 1.41,3.61 and 6.00 , assign minimum as Cluster number ie., 1.41 is minimum
and name as C1
Add centroid points (c1) and find the average. Similarly C2 and C3
c1(y)=(4+6+7)/3=5.66
Clustering – solution
C1 C1 C1 C1 C1
C1 C1 C1 C1 C1
C3 C3 C3 C3 C3
C1 C1 C1 C3 C3
C3 C3 C2 C2 C2
C3 C3 C3 C3 C3
C2 C2 C2 C2 C2
C3 C3 C3 C3 C3
C3 C2 C2 C2 C2
C2 C2 C2 C2 C2
• A lower-space vector requires less storage space, so the data is compressed. Due to the
density matching property of vector quantization, the compressed data has errors that are
inversely proportional to density.
• A vector quantizer maps k-dimensional vectors in the vector space Rk into a finite set of
vectors Y = {yi: i = 1, 2, ..., N}. Each vector yi is called a code vector or a codeword.
and the set of all the codewords is called a codebook. Associated with each
codeword, yi, is a nearest neighbor region called Voronoi region, and it is defined by:
Related application, data compression, which is used both for storing data and for the
transmission of speech and image data. The reason that the applications are related is that
both replace the current input by the cluster centre that it belongs to.
For noise reduction, do this to replace the noisy input with a cleaner one, while for data
compression do it to reduce the number of data points that send.
instead of transmitting the actual data, I can transmit the index of that data point in the
codebook, which is shorter- sound and image compression algorithm has a different method
of solving it.
Problem is that the codebook won’t contain every possible datapoint - datapoint isn’t in
the codebook? In that case ,need to accept that our data will not look exactly the same, and
I send you the index of the prototype vector that is closest to it (this is known as vector
quantisation, and is the way that lossy compression works).
The name for each cell is the Voronoi set of a particular prototype. Together, they produce
the Voronoi tesselation of the space. If connect together every pair of points that share an
edge, as is shown by the dotted lines, then get the Delaunay triangulation, which is the
optimal way to organize the space to perform function approximation
Choose prototype vectors that are as close as possible to all of the possible inputs. This
application is called learning vector quantisation because learning an efficient vector
quantisation. The k-means algorithm can be used to solve the problem
• SOM is a technique which reduce the dimensions of data through the use of self-organizing neural
networks.
• The model was first described as an artificial neural network by professor Teuvo Kohonen.
Unsupervised learning
• Unsupervised learning is a class of problems in which one seeks to determine how the data are
organized.
How could we know what constitutes “different” clusters? • Green Apple and Banana Example. –
two features: shape and color.
Competitive learning
The position of the unit for each data point can be expressed as follows:
Competitive learning is useful for clustering of input patterns into a discrete set of output clusters.
Algorithm
4. Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector
Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t))
5. Increment t and repeat from 2 while t < λ
Application
• In machine learning classification problems, there are often too many factors on the basis
of which the final classification is done. These factors are basically variables called
features.
• The higher the number of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and hence redundant. This is
where dimensionality reduction algorithms come into play.
• Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
– Filter
– Wrapper
– Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Goals
Dimensionality Reduction
• It ensures that the converted data set conveys similar information concisely.
Example-
• We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
Benefits-
• It compresses the data and thus reduces the storage space requirements.
• It reduces the time required for computation since less dimensions require less
computation.
• This transformation is defined in such a way that the first principal component has the
largest possible variance and each succeeding component in turn has the next highest
possible variance.
PCA Approach
• Transform the original dataset via projection matrix to obtain a k-dimensional feature
subspace.
• It transforms the variables into a new set of variables called as principal components.
• These principal components are linear combination of original variables and are
orthogonal.
• The first principal component accounts for most of the possible variation of original data.
• The second principal component does its best to capture the variance in the data.
There can be only two principal components for a two-dimensional data set.
Limitation of PCA
Applications of PCA :
• Neuroscience
PCA algorithm steps:
Step 1: Standardization of the data- is all about scaling your data in such a way that all the
variables and their values lie within a similar range
PCA Algorithm-
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
OR
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
CLASS 1
X=2,3,4
Y=1,5,3
CLASS 2
X=5,6,7
Y=6,7,8
Solution-
Step-01:
Get data.
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
Thus,
Step-03:
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Step-04:
λ2 – 8.59λ + 3.09 = 0
Clearly, the second eigen value is very small compared to the first eigen value.
Eigen vector corresponding to the greatest eigen value is the principal component for the given
data set. So. we find the eigen vector corresponding to eigen value λ1.
MX = λX
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
Lastly, we project the data points onto the new subspace as-
Example
4.2 Factor Analysis – Independent Component Analysis – Locally Linear Embedding –
Isomap
✓ FA is a correlational method used to find and describe the underlying factors driving data
values for a large set of variables.
✓ FA identifies correlations between and among variables to bind them into one underlying
factor driving their values
✓ FA examines the interrelationships among a large number of variables and then attempts
to explain them in terms of their common underlying dimension
✓ Interdependence techniques
Assumptions of FA
• This means that no extreme data are present in the data set. For example, consider
the following values:1,5,8,-4,23,18 and 1,247,942
• This means that you must have more variables than you have factors. You cannot
have only 3 variables and have 4 factors
• Each variable must also have more data values than you have factors. Our data
sets will be large.
• No perfect multicollinearity
• This means that each value is unique.
• Homoscedasticity means that all variables have the same finite variance. In other
words, the curves do not have to possess the same size standard deviations
• Linearity of variables
• Unlike variables directly measured such as speed, height, weight etc., some variables
such as creativity, happiness, religiosity, comfort are not a single measureable entity
• They are constructs that are derived from the measurement of other directly observable
variables
• Factor analysis is a correlational method used to find and describe the underlying factors
driving data values for a large set of variables
• It identifies correlations between and among variables to bind them into one underlying
factor driving their values.
• V5 is a factor.
• Need to
The problem of factor analysis is to find those independent factors, and the noise that is
inherent in the measurements of each factor. Factor analysis is commonly used in psychology
and other social sciences
Uses of FA
• Confirmatory Modelling
• ICA is a computational technique used for extracting the source signals from an observed
mixture of multivariate signal.
• ICA model
• ICA is used for source signals separation in many applications including medical data,
audio signals, or optical imaging. Data can be in the form of images, sounds or other time
series data
• ICA can be applied as a dimensionality reduction algorithm because after extracting the
source signals, the unnecessary signals can be removed or deleted.
• ICA is considered as an extension of the PCA. But PCA tries to find the axis that
maximizes the variance using second order statistics, while ICA tries to maximize the
independence between source signals using the higher order statistics
• In PCA, the components were orthogonal and uncorrelated (so that the covariance matrix
was diagonal, i.e., so cov(bi,bj)=0.
• Require that the components are statistically independent (so that for E[bi,bj]= E[bi]E[bj]
as well as the bi being uncorrelated), then get ICA. The common motivation for ICA is
the problem of blind source separation( Blind signal separation (BSS) aims at recovering
unknown source signals from the observed sensor signals where the mixing process is
also unknown).
• The most popular way to describe blind source separation is known as the cocktail party
problem.
The cock tail party problem is the challenge of separating out hear lots of different sounds
coming from lots of different locations (different people talking, the clink of glasses, background
music, etc.) sources from the party. The objective is to detect the speech of different people
where a group of people are talking simultaneously
– Independence: the source signal are independent; however, their signal mixtures
are not as they share the same source signals
– Non normality: independent signals should come from non Gaussian distributions.
Otherwise, the ICA cannot be applied to extract source signals.
– Complexity: the complexity of the mixed signals is greater than that of its
components
• At a cocktail party, there are lots of sounds going around you
– Background music
Our brain do a really good job of filtering out the background noise and focusing on just one
sound, but how would a computer do this?
Assumptions
• Non-Gaussian sources
• The first tries to approximate the data by sticking together sets of locally flat patches that
cover the dataset, while the second uses the shortest distances (geodesics) on the non-
linear space to find a globally optimal solution.
• The locally linear algorithm, which is called Locally Linear Embedding (LLE). It was
introduced by Roweis and Saul in 2000. The idea is that making linear approximations
will make some errors, so should make these errors as small as possible by making the
patches small where there is lots of non-linearity in the data. The error is known as the
reconstruction error and is simply the sum-of-squares of the distance between the original
point and its reconstruction:
– Points that are less than some predefined distanced to the current point are
neighbours (so don’t know how many neighbours there are, but they are all close)
– The k nearest points are neighbours (so e know how many there are, but some
could be far away)
The above diagram is the Locally linear embedding(LLE) algorithm with k-12
neighbours transforms the iris dataset into three points separating the data perfectly. The
LLE produces a very interesting result on the iris dataset. It separates the three groups
into three points.
ISOMAP
• After that, it uses graph distance to the approximate geodesic distance between all pairs
of points.
• Like PCA, MDS tries to find a linear approximation to the full dataspace that embeds the
data into a lower dimensionality.
• In the case of MDS the embedding tries to preserve the distances between all pairs of
points
4.3 Evolutionary Learning-Genetic algorithms - Genetic Offspring: - Genetic Operators-
Using Genetic Algorithms
• Genetic Algorithm- uses concept from evolutionary Biology(Natural Genetics & Natural
selection).
• Genetic Algorithms are the heuristic search and optimization techniques that
mimic the process of natural evolution.
Applications
• DNA analysis
• Robotics
• Game playing
• Business
• Machine learning
• Image processing
• Vehicle Routing
• Neural Network
Genetic Algorithm
• At the end of a run often there is at least one highly fit chromosome in the population
Basic Terminology of GA
• Example:
o B- black hair
o b-brown hair
o Bb,bB,BB,bb- Genotype
• Example
o BB,bB-black hair
o Bb-brown hair
The process that determines which solutions are to be preserved and allowed to reproduce and
which ones deserve to die out.
The primary objective of the selection operator is to emphasize the good solutions and eliminate
the bad solutions in a population while keeping the population size constant.
• population
• They are:
o Tournament selection
o Proportionate selection
o Rank selection
Tournament selection
□ In tournament selection several tournaments are played among a few individuals. The
individuals are chosen at random from the population.
Fitness function
• A fitness function value quantifies the optimality of a solution. The value is used to rank
a particular solution against all the other solutions
• A fitness value is assigned to each solution depending on how close it is actually to the
optimal solution of the problem
Crossover operator
• The most popular crossover selects any two solutions strings randomly from the mating
pool and some portion of the strings is exchanged between the strings.
Binary Crossover
Mutation operator
• Mutation is the occasional introduction of new features in to the solution strings of the
population pool to maintain diversity in the population.
• Though crossover has the main responsibility to search for the optimal solution, mutation
is also used for this purpose.
Binary Mutation
• A high value of mutation probability would search here and there like a random search
technique
Example
Maximize the function f(x)=x2 over the range of integers from 0 …31
The function f(x) is simple, so it is easy to use the f(x) value itself to rate the fitness of a
solution; else we might have considered a more simpler heuristic that would more or less
serve the same purpose.
GAs often process binary representations of solutions. This works well, because
crossover and mutation can be clearly defined for binary solutions. A binary string of
length 5 can represent 32 numbers (0 to 31)
Here, considered a population of four solutions. However, large populations are used in
real applications to explore a larger part of the search. Assume four randomly generated
solutions as :01101,11000,01000,10011. These are chromosomes or genotypes.
Cross over operator
Mutation operator
Advantages of Genetic Algorithm
• Does not require any derivative information which may not be available for many real
world problems
• Optimizes both continuous and discrete functions and also multi objective problems
• Provides a list of ‘good’ solutions and not just a single solution. Always gets an answer
which gets better over the time.
• Useful when the search space is very large and there are a number of parameters
involved.
Reinforcement learning
Every action has some impact in the environment, and the environment provides rewards that
guides the learning algorithm.
Reinforcement Learning
Step: 1
Step: 2
• Learner: Action B.
Step: 3 ....
• Meaning of Reinforcement:
• Reinforcement learning is the problem faced by an agent that learns behavior through
trial-and-error interactions with a dynamic environment.
Examples of RL
Agent –Environment Interface
1. The agent( The RL algorithm that learns from trial and error) observes an input state.
3. The action is performed. All the possible steps that the agent can take.
4. The agent receives a scalar reward or reinforcement from the environment(The world
through which the agent moves).
5. Information about the reward given for that state / action pair is recorded.
Markov Decision Process(MDP) : Reinforcement learning deals with knowledge based on
current state and its future prediction of next state with optimal way. The direction of this
movement towards optimal solution is proposed by agent under MDP.
Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a
particular state
UNIT - V GRAPHICAL MODELS
Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain
Monte Carlo – Graphical Models – Bayesian Networks – Markov Random Fields – Hidden
Markov Models – Tracking Methods.
• Monte Carlo sampling is not effective and may be intractable for high-dimensional probabilistic
models.
• Markov Chain Monte Carlo provides an alternate approach to random sampling a high-
dimensional probability distribution where the next sample is dependent upon the current
sample.
• Gibbs Sampling and the more general Metropolis-Hastings algorithm are the two most common
approaches to Markov Chain Monte Carlo sampling.
• Markov Chain Monte Carlo is a method to sample from a population with a complicated
probability distribution.
• Sample - A subset of data drawn from a larger population. (Also used as a verb to sample; i.e. the
act of selecting that subset. Also, reusing a small piece of one song in another song, which is not
so different from the statistical practice, but is more likely to lead to lawsuits.) Sampling permits
us to approximate data without exhaustively analyzing all of it, because some datasets are too
large or complex to compute. We’re often stuck behind a veil of ignorance, unable to gauge reality
around us with much precision. So we sample.1
• Population - The set of all things we want to know about; e.g. coin flips, whose outcomes we want
to predict. Populations are often too large for us to study them in toto, so we sample. For example,
humans will never have a record of the outcome of all coin flips since the dawn of time. It’s
physically impossible to collect, inefficient to compute, and politically unlikely to be allowed.
Gathering information is expensive. So in the name of efficiency, we select subsets of the
population and pretend they represent the whole. Flipping a coin 100 times would be a sample
of the population of all coin tosses and would allow us to reason inductively about all the coin
flips we cannot see.
• Distribution (or probability distribution) - You can think of a distribution as table that links
outcomes with probabilities. A coin toss has two possible outcomes, heads (H) or tails (T). Flipping
it twice can result in either HH, TT, HT or TH. So let’s construct a table that shows the outcomes
of two-coin tosses as measured by the number of H that result.
• Markov Chain Monte Carlo (MCMC) is a mathematical method that draws samples randomly from
a black box to approximate the probability distribution of attributes over a range of objects or
future states. You could say it’s a large-scale statistical method for guess-and-check. MCMC
methods help gauge the distribution of an outcome or statistic you’re trying to predict, by
randomly sampling from a complex probabilistic space. As with all statistical techniques, we
sample from a distribution when we don’t know the function to succinctly describe the relation
to two variables (actions and rewards). MCMC helps us approximate a black-box probability
distribution.
• Markov Property says that given a process which is at a state Xn at a particular point of time, the
probability of Xn+1=k, where k is any of the M states the process can jump to, will only be
dependent on which state it is at the given moment. And not on how it reached the current state.
Mathematically speaking
• Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a
probability distribution based on constructing a Markov chain that has the desired distribution
as its stationary distribution. The state of the chain after a number of steps is then used as a
sample of the desired distribution. The quality of the sample improves as a function of the number
of steps.
• MCMC methods make life easier for us by providing us with algorithms that could create a Markov
Chain which has the Beta distribution as its stationary distribution given that we can sample from
a uniform distribution(which is relatively easy).
MARKOV IDEA
•
Markov property : p ( x ( i ) | x ( i −1) ,..., x (1) ) = T ( x ( i ) | x ( i −1) )
• such that when simulating a trajectory of states from it, it will explore the state space spending
more time in the most important regions (i.e. where p(x) is large)
• Like other MCMC methods, the Gibbs sampler constructs a Markov Chain whose values converge
towards a target distribution. Gibbs Sampling is in fact a specific case of the Metropolis-
Hastings algorithm wherein proposals are always accepted.
Let’s take a look at an example. Suppose we had the following posterior and conditional probability
distributions.
where g(y) contains the terms that don’t include x, and g(x) contains the those don’t depend on y. We
don’t know the value of C (normalizing constant). However, we do know the conditional probability
distributions. Therefore, we can use Gibbs Sampling to approximate the posterior distribution.
METRO POLIS –HASTING ALGORTIHM:-
SAMPLINGS:-
Statistical sampling is a large field of study, but in applied machine learning, there may be three types of
sampling that you are likely to use: simple random sampling, systematic sampling, and stratified sampling.
Simple Random Sampling: Samples are drawn with a uniform probability from the domain.
Systematic Sampling: Samples are drawn using a pre-specified pattern, such as at intervals.
Stratified Sampling: Samples are drawn within pre-specified categories (i.e. strata).
Although these are the more common types of sampling that you may encounter, there are other
techniques.
PROPOSTIONAL SAMPLING
A walk through the concept of proportional sampling by an example explanation with python codes to
perform the same.
Example problem
Algorithm
What is proportional sampling?
In most simple words, proportional sampling is a sampling of a population in which the probability of
finding an element is proportional to some common shared attribute or property of all the elements in
the population. For example, suppose you have a set of numbers, say {2,5,8,15,46,90}, and you want to
randomly pick a number but you don’t want the probability to be uniform. Instead, you want the
probability of finding a number to be proportional to the face values of the number which precisely means
that 90 should have the highest probability and 2 should have the lowest probability to be picked up.
EXAMPLE:
In Europe a new football tournament was announced. Many big business men came forward to start their
own clubs. There was one owner, Mr. Robert, who had no knowledge about how to choose right player
for the team. But he was quite certain that a greater number of goals a player has scored, the better is
the player. With this much of knowledge is arranged the player vs number of goals data set.
Each team has to select 18 players. Now the rule of player choosing was that Mr. Robert could take 4
players of his choice and has to select the other 14 randomly. So basically Mr. Robert has to select 14
random players from table. Now the problem is how to randomly select the player such that the
probability of selecting the player is more if the number of goals is more.
For each cumulative normalized G in the list of G” , if r ≤ G’’_i ,then return player corresponding to
(G’’_i)*S.
• Framework for modeling and efficiently reasoning about multiple correlated random variables
• Traditional learning algorithms assume – Data available in record format – Instances are i.d
samples
• Recent domains like Web, Biology, Marketing have more richly structured data
• Examples : DNA Sequences, Social Networks, Hyperlink structure of Web, Phylogeny Trees
The nature of the problem that we are generally interested to solve or the type of queries we want to
make are all probabilistic because of uncertainty. There are many reasons that contributes to it.
Graphical
Graphical representation helps us to visualize better and so, we use Graph Theory to reduce the no of
relevant combinations of all the participating variables to represent the high dimensional probability
distribution model more company.
Models
• Directed graphical models specify a factorization of the joint distribution over a set of variables
into a product of local conditional distributions
• Second major class of graphical models that are described by undirected graphs and that again
• No inference algorithms
• Image de‐noising
• Image de‐blurring
• Image segmentation
• Image super‐resolution
Hidden Markov Model
The Hidden Markov Model (HMM) is a relatively simple way to model sequential data.
A hidden Markov model implies that the Markov Model underlying the data is hidden or unknown
to you. More specifically, you only know observational data and not information about the states.
In other words, there’s a specific type of model that produces the data (a Markov Model) but you
don’t know what processes are producing it. You basically use your knowledge of Markov Models
to make an educated guess about the model’s structure.
HMM states (X), observations (O) and probabilities (A, B). Source: Stamp 2018
Consider weather, stock prices, DNA sequence, human speech or words in a sentence. In all these
cases, current state is influenced by one or more previous states. Moreover, often we can observe
the effect but not the underlying cause that remains hidden from the observer. Hidden Markov
Model (HMM) helps us figure out the most probable hidden state given an observation.
In practice, we use a sequence of observations to estimate the sequence of hidden states. In HMM,
the next state depends only on the current state. As such, it's good for modelling time series
data.We can classify HMM as a generative probabilistic model since a sequence of observed
variables is generated by a sequence of hidden states. HMM is also seen as a specific kind
of Bayesian network.
Discussion
Suppose Bob tells his friend Alice what he did earlier today. Based on this information Alice
guesses today's weather at Bob's location. In HMM, we model weather as states and Bob's activity
as observations.
• Transition Probabilities: Probability of moving from one state to another. For example, "If today was
sunny, what's the probability that it will rain tomorrow?" If there are N states, this is an NxN matrix.
• Emission Probabilities: Probability of a particular output given a particular state. For example, "What's
the chance that Bob is walking if it's raining?" Given a choice of M possible observation symbols, this is an
NxM matrix. This is also called output or observation probabilities.
• Initial Probabilities: Probability of being in a state at the start, say, yesterday or ten days ago.
Unlike a typical Markov chain, we can't see the states in HMM. However, we can observe the
output and then predict the state. Thus, the states are hidden, giving rise to the term "hidden" in
the name HMM.
•
Typical notation used in HMM. Source: Kang 2017.
Let A, B and π denote the transition matrix, observation matrix and initial state distribution
respectively. HMM can be represented as λ = (A, B, π). Let observation sequence be O and state
sequence be Q.
• Likelihood Problem: Given O and λ, find the likelihood P(O|λ). How likely is a particular sequence of
observations? Forward algorithm solves this problem.
• Decoding Problem: Given O and λ, find the best possible Q that explains O. Given the observation
sequence, what's the best possible state sequence? Viterbi algorithm solves this problem.
• Learning Problem: Given O and Q, learn λ, perhaps by maximizing P(O|λ). What model best maps states
to observations? Baum-Welch algorithm, also called forward-backward algorithm, solves this problem. In
the language of machine learning, we can say that O is training data and the number of states N is the
model's hyperparameter.
HMM has been applied in many areas including automatic speech recognition, handwriting
recognition, gesture recognition, part-of-speech tagging, musical score following, partial
discharges and bioinformatics. In speech recognition, a spectral analysis of speech gives us suitable
observations for HMM. States are modelled after phonemes or syllables, or after the average
number of observations in a spoken word. Each word gets its own model. To tag words with their
parts of speech, the tags are modelled as hidden states and the words are the observations. In
computer networking, HMMs are used in intrusion detection systems. This has two flavours:
anomaly detection in which normal behaviour is modelled; or misuse detection in which a
predefined set of attacks is modelled. In computer vision, HMM has been used to label human
activities from skeleton output. Each activity is modelled with a HMM. By linking
multiple HMMs on common states, a compound HMM is formed. The purpose is to allow robots
to be aware of human activity.
What are the different types of Hidden Markov Models?
In the typical model, called the ergodic HMM, the states of the HMM are fully connected so that
we can transition to a state from any other state. Left-right HMM is a more constrained model in
which state transitions are allowed only from lower indexed states to higher indexed ones.
Variations and combinations of these two types are possible, such as having two parallel left-to-
right state paths. HMM started with observations of discrete symbols governed
by discrete probabilities. If observations are continuous signals, then we would
use continuous observation density.
There are also domain-specific variations of HMM. For example, in biological sequence analysis,
there are at least three types including profile-HMMs, pair-HMMs, and context-sensitive HMMs.
•
Trellis diagrams showing forward and backward algorithms. Source: Adapted from Jana 2019b.
Every state sequence has a probability that it will lead to a given sequence of observations. Given
T observations and N states, there are NTNT possible state sequences. Thus, the complexity of
calculating the probability of a given sequence of observations is O(NTT)O(NTT). Both
forward and backward algorithms bring down the complexity
to O(N2T)O(N2T) through dynamic programming.
In the forward algorithm, we consider the probability of being in a state at the current time step.
Then we consider the transition probabilities to calculate the state probabilities for the next step.
Thus, at each time step we have considered all state sequences preceding it. The algorithm is more
efficient since it reuses calculations from earlier steps. Instead of keeping all path sequences, paths
are folded into a forward trellis. Backward algorithm is similar except that we start from the last
time step and calculate in reverse. We're finding the probability that from a given state, the model
will generate the output sequence that follows.
A combination of both algorithms, called forward-backward algorithm, is used to solve the
learning problem.
Viterbi algorithm solves HMM's decoding problem. It's similar to the forward algorithm except
that instead of summing the probabilities of all paths leading to a state, we retain only one path
that gives maximum probability. Thus, at every time step or iteration, given that we have N states,
we retain only N paths, the most likely path for each state. For the next iteration, we use the most
likely paths of current iteration and repeat the process.
When we reach the end of the sequence, we'll have N most likely paths, each ending in a unique
state. We then select the most likely end state. Once this selection is made, we backtrack to read
the state sequence, that is, how we got to the end state. This state sequence is now the most likely
sequence given our sequence of observations.
In HMM's learning problem, we are required to learn the transition (A) and observation (B)
probabilities when given a sequence of observations and the vocabulary of hidden states.
The forward-backward algorithm solves this problem. It's an iterative algorithm. It starts with
an initial estimate of the probabilities and improves these estimates with each iteration.
• Expectation or E-step: We compute the expected state occupancy count and the expected state transition
count based on current probabilities A and B.
• Maximization or M-step: We use the expected counts from the E-step to recompute A and B.
While this algorithm is unsupervised, in practice, initial conditions are very important. For this
reason, often extra information is given to the algorithm. For example, in speech recognition,
the HMM structure is set manually and the model is trained to set the initial probabilities.
What is Object Tracking?
Object tracking is an application of deep learning where the program takes an initial set of object
detections and develops a unique identification for each of the initial detections and then tracks
the detected objects as they move around frames in a video. In other words, object tracking is the
task of automatically identifying objects in a video and interpreting them as a set of trajectories
with high accuracy. Often, there’s an indication around the object being tracked, for example, a
surrounding square that follows the object, showing the user where the object is on the screen.
Object tracking is used for a variety of use cases involving different types of input footage.
Whether or not the anticipated input will be an image or a video, or a real-time video vs. a
prerecorded video, impacts the algorithms used for creating object tracking applications. The kind
of input also impacts the category, use cases, and applications of object tracking. Here, we will
briefly describe a few popular uses and types of object tracking, such as video tracking, visual
tracking, and image tracking.
Video Tracking
Video tracking is an application of object tracking where moving objects are located within video
information. Hence, video tracking systems are able to process live, real-time footage and also
recorded video files. The processes used to execute video tracking tasks differ based on which type
of video input is targeted. This will be discussed more in-depth when we compare batch and online
tracking methods later in this article. Visual tracking or visual target-tracking is a research topic
in computer vision that is applied in a large range of everyday scenarios. The goal of visual tracking
is to estimate the future position of a visual target that was initialized without the availability of
the rest of the video.
Image Tracking
Image tracking is meant for detecting two-dimensional images of interest in a given input. That
image is then continuously tracked as they move in the setting. Image tracking is ideal for datasets
with highly contrasting images (ex. black and white), asymmetry, few patterns, and multiple
identifiable differences between the image of interest and other images in the image set. Image
tracking relies on computer vision to detect and augment images after image targets are
predetermined.
Modern object tracking methods can be applied to real-time video streams of basically any camera.
Therefore, the video feed of a USB camera or an IP camera can be used to perform object tracking,
by feeding the individual frames to a tracking algorithm. Frame skipping or parallelized processing
are common methods to improve object tracking performance with real-time video feeds of one or
multiple cameras.
The main challenges usually stem from issues in the image that make it difficult for object tracking
models to effectively perform detections on the images. Here, we will discuss the few most
common issues with the task of tracking objects and methods of preventing or dealing with these
challenges.
Algorithms for tracking objects are supposed to not only accurately perform detections and localize
objects of interest but also do so in the least amount of time possible. Enhancing tracking speed is
especially imperative for real-time object tracking models. To manage the time taken for a model
to perform, the algorithm used to create the object tracking model needs to be either customized
or chosen carefully. Fast R-CNN and Faster R-CNN can be used to increase the speed of the most
common R-CNN approach. Since CNNs (Convolutional Neural Networks) are commonly used for
object detection, CNN modifications can be the differentiating factor between a faster object
tracking model and a slower one. Design choices besides the detection framework also influence
the balance between speed and accuracy of an object detection model.
2. Background Distractions
The backgrounds of inputted images or images used to train object tracking models also impact
the accuracy of the model. Busy backgrounds of objects meant to be tracked can make it harder
for small objects to be detected. With a blurry or single-color background, it is easier for an AI
system to detect and track objects. Backgrounds that are too busy, have the same color as the
object, or that are too cluttered can make it hard to track results for a small object or a lightly
colored object.
Objects meant to be tracked can come in a variety of sizes and aspect ratios. These ratios can
confuse the object tracking algorithms into believing objects are scaled larger or smaller than their
actual size. The size misconceptions can negatively impact detections or detection speed. To
combat the issue of varying spatial scales, programmers can implement techniques such as feature
maps, anchor boxes, image pyramids, and feature pyramids.
Anchor Boxes: Anchor boxes are a compilation of bounding boxes that have a specified height
and width. The boxes are meant to acquire the scale and aspect ratios of objects of interest. They
are chosen based on the average object size of the objects in a given dataset. Anchor boxes allow
various types of objects to be detected without having the bounding box coordinates
alternated during localization.
Feature Maps: A feature map is the output image of a layer when a Convolutional Neural Network
(CNN) is used to capture the result of applying filters to that input image. Feature maps allow a
deeper understanding of the features being detected by a CNN. Single-shot detectors have to
consider the issue of multiple scales because they detect objects with just one pass through a CNN
framework. This will occur in a detection decrease for small images. Small images can lose signal
during down sampling in the pooling layers, which is when the CNN was trained on a low subset
of those smaller images. Even if the number of objects is the same, down sampling can occur
because the CNN wasn’t able to detect the small images and count them towards the sample size.
To prevent this, multiple feature maps can be used to allow single-shot detectors to look for objects
within CNN layers – including earlier layers with higher resolution images. Single-shot detectors
are still not an ideal option for small object tracking because of the difficulty they experience when
detecting small objects. Tight groupings can prove especially difficult. For instance, overhead
drone shots of a group of herd animals will be difficult to track using single-shot detectors.
Image and Feature Pyramid Representations: Feature pyramids, also known as multi-level feature
maps because of their pyramidal structure, are a preliminary solution for object scale variation
when using object tracking datasets. Hence, feature pyramids model the most useful information
regarding objects of different sizes in a top-down representation and therefore make it easier to
detect objects of varying sizes. Strategies such as image pyramids and feature pyramids are useful
for preventing scaling issues. The feature pyramid is based on multi-scale feature maps, which
uses less computational energy than image pyramids. This is because image pyramids consist of a
set of resized versions of one input image that are then sent to the detector at testing.
4. Occlusion
Occlusion has a lot of definitions. In medicine, occlusion is the “blockage of a blood vessel” due
to the vessel merging to a close; in deep learning, it has a similar meaning. In AI vision tasks using
deep learning, occlusion happens when multiple objects come too close together (merge). This
causes issues for object tracking systems because the occluded objects are seen as one or simply
track the object incorrectly. The system can get confused and identify the initially tracked object
as a new object. Occlusion sensitivity prevents this misidentification by allowing the user to
understand which parts of an image are the most important for the object tracking system to
classify. Occlusion sensitivity refers to a measure of the network’s sensitivity to occlusion in
different data regions. It is done using small subsets of the original dataset.
Object Tracking consists of multiple subtypes because it is such a broad application. Levels of
object tracking differ depending on the number of objects being tracked.
Multiple object tracking is defined as the problem of automatically identifying multiple objects in
a video and representing them as a set of trajectories with high accuracy. Hence, multi-object
tracking aims to track more than one object in digital images. It is also called multi-target tracking,
as it attempts to analyze videos to identify objects (“targets”) that belong to more than one
predetermined class Multiple object tracking is of great importance in autonomous driving, where
it is used to detect and predict the behavior of pedestrians or other vehicles. Hence, the algorithms
are often benchmarked on the KITTI tracking test. KITTI is a challenging real-world computer
vision benchmark and image dataset, popularly used in autonomous driving. In 2021, the best
performing multiple object tracking algorithms are DEFT (88.95 MOTA, Multiple Object
Tracking Accuracy), Center Track ( 89.44 MOTA), and SRK ODESA (90.03 MOTA).
Multiple Object Tracking (MOT) vs. General Object Detection
Object detections typically produce a collection of bounding boxes as outputs. Multiple object
tracking often has little to no prior training regarding the appearance and number of targets.
Bounding boxes are identified using their height, width, coordinates, and other parameters.
Meanwhile, MOT algorithms additionally assign a target ID to each bounding box. This target ID
is known as a detection, and it is important because it allows the model to distinguish among
objects within a class. For example, instead of identifying all cars in a photo where multiple cars
are present as just “car,” MOT algorithms attempt to identify different cars as being different from
each other rather than all falling under the “car” label. For a visual representation of this metaphor,
refer to the image below.
Single Object Tracking
Single Object Tracking (SOT) creates bounding boxes that are given to the tracker based on the
first frame of the input image. Single Object Tracking is also sometimes known as Visual Object
Tracking. SOT implies that one singular object is tracked, even in environments involving other
objects. Single Object Trackers are meant to focus on one given object rather than multiple. The
object of interest is determined in the first frame, which is where the object to be tracked is
initialized for the first time. The tracker is then tasked with locating that unique target in all other
given frames. SOT falls under the detection-free tracking category, which means that it requires
manual initialization of a fixed number of objects in the first frame. These objects are then
localized in consequent frames. A drawback of detection-free tracking is that it cannot deal with
scenarios where new objects appear in the middle frames. SOT models should be able to track any
given object.
Batch method: Batch tracking algorithms use information from future video frames when deducing
the identity of an object in a certain frame. Batch tracking algorithms use non-local information
regarding the object. This methodology results in a better quality of tracking.
Online method: While batch tracking algorithms access future frames, online tracking algorithms
only use present and past information to come to conclusions regarding a certain frame.
Online tracking methods for performing MOT generally perform worse than batch
methods because of the limitation of online methods staying constrained to the present frame.
However, this methodology is sometimes necessary because of the use case. For example, real-
time problems requiring the tracking of objects, like navigation or autonomous driving, do not
have access to future video frames, which is why online tracking methods are still a viable option.
Most multiple object tracking algorithms contain a basic set of steps that remain constant as
algorithms vary. Most of the so-called multi-target tracking algorithms share the following stages:
Stage #1: Designation or Detection: Targets of interest are noted and highlighted in the designation
phase. The algorithm analyzes input frames to identify objects that belong to target classes.
Bounding boxes are used to perform detections as part of the algorithm.
Stage #2: Motion: Feature extraction algorithms analyze detections to extract appearance and
interaction features. A motion predictor, in most cases, is used to predict subsequent positions of
each tracked target.
Stage #3: Recall: Feature predictions are used to calculate similarity scores between detection
couplets. Those scores are then used to associate detections that belong to the same target. IDs are
assigned to similar detections, and different IDs are applied to detections that are not part of pairs.
Some object tracking models are created using these steps separately from each other, while others
combine and use the steps in conjunction. These differences in algorithm processing create unique
models where some are more accurate than others.
Convolutional Neural Networks (CNN) remain the most used and reliable network for object
tracking. However, multiple architectures and algorithms are being explored as well. Among these
algorithms are Recurrent Neural Networks (RNNs), Autoencoders (AEs), Generative Adversarial
Networks (GANs), Siamese Neural Networks (SNNs), and custom neural networks.
Although object detectors can be used to track objects if it is applied frame-by-frame, this is a
computationally limiting and therefore a rather inefficient method of performing object tracking.
Instead, object detection should be applied once, and then the object tracker can handle every frame
after the first. This is a more computationally effective and less cumbersome process of performing
object tracking.
OpenCV object tracking is a popular method because OpenCV has so many algorithms built-in
that are specifically optimized for the needs and objectives of object or motion tracking.
Specific Open CV object trackers include the BOOSTING, MIL, KCF, CSRT, Median Flow,
TLD, MOSSE, and GOTURN trackers. Each of these trackers is best for different goals. For
example, CSRT is best when the user requires a higher object tracking accuracy and can tolerate
slower FPS throughput. The selection of an OpenCV object tracking algorithm depends on the
advantages and disadvantages of that specific tracker and the benefits:
The KCF tracker is not as accurate compared to the CSRT but provides comparably higher FPS.
The MOSSE tracker is very fast, but its accuracy is even lower than tracking with KCF. Still, if
you are looking for the fastest object tracking OpenCV method, MOSSE is a good choice.
The GOTURN tracker is the only detector for deep learning-based object tracking with OpenCV.
The original implementation of GOTURN is in Caffe, but it has been ported to the OpenCV
Tracking API.
2. DeepSORT
DeepSORT is a good object tracking algorithm choice, and it is one of the most widely used object
tracking frameworks. Appearance information is integrated within the algorithm, which vastly
improves DeepSORT performance. Because of the integration, objects are trackable through
longer periods of occlusion – reducing the number of identity switches.
For complete information on the inner workings of DeepSORT and specific algorithmic
differences between DeepSORT and other algorithms, we suggest the article “Object Tracking
using DeepSORT in TensorFlow 2” by Anushka Dhiman.
4. MDNet
MDNet is a fast and accurate, CNN-based visual tracking algorithm inspired by the R-CNN object
detection network. It functions by sampling candidate regions and passing them through a CNN.
The CNN is typically pre-trained on a vast dataset and refined at the first frame in an input video.
Therefore, MDNet is most useful for real-time object tracking use cases. However, while it suffers
from high computational complexity in terms of speed and space, it still is an accurate option. The
computation-heavy aspects of MDNet can be minimized by performing RoI (Region of Interest)
Pooling, however, which is a relatively effective way of avoiding repetitive observations and
accelerating inference.