Machine Learning - 1
Machine Learning - 1
Machine Learning
“Learning denotes changes in the system that are adaptive in the sense that they enable the system
to do the same task (or tasks drawn from the same population) more effectively the next time.” --
Herbert Simon
Why would we want an agent to learn? If the design of the agent can be improved, why
wouldn’t the designers just program in that improvement to begin with?
1. First, the designers cannot anticipate all possible situations that the agent might find itself
in. For example, a robot designed to navigate mazes must learn the layout of each new
maze it encounters.
2. Second, the designers cannot anticipate all changes over time; a program designed to
predict tomorrow’s stock market prices must learn to adapt when conditions change from
boom to bust.
3. Third, sometimes human programmers have no idea how to program a solution themselves.
For example, most people are good at recognizing the faces of family members, but even
the best programmers are unable to program a computer to accomplish that task, except by
using learning algorithms.
1
Collected by Bipin Timalsina
Machine Learning
The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer
in the field of computer gaming and artificial intelligence. Acording to him, “Machine
Learning is Field of study that gives computers the ability to learn without being
explicitly programmed”
And in 1997, Tom Mitchell gave a “well-posed”mathematical and relationald efinition.
According to him, “A computer program is said to learn from experience E with respect
to some task T and some performance measure P, if its performance on T, as measured
by P, improves with experience E”.
The difference between traditional programming and machine learning can be illustrated
with following figures (source: www.datasciencecentral.com)
Machine learning (ML) is a category of an algorithm that allows software applications to become
more accurate in predicting outcomes without being explicitly programmed. The basic premise of
2
Collected by Bipin Timalsina
Machine Learning
machine learning is to build algorithms that can receive input data and use statistical analysis to
predict an output while updating outputs as new data becomes available.
Machine learning usually refers to the changes in systems that perform tasks associated with
artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot control,
prediction, etc. The changes might be either enhancements to already performing systems or
synthesis of new systems
Any component of an agent can be improved by learning from data. The improvements, and the
techniques used to make them, depend on four major factors:
Supervised Learning
In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output.
The task of supervised learning is this:
Here x and y can be any value; they need not be numbers. The function h is a hypothesis.
3
Collected by Bipin Timalsina
Machine Learning
Learning is a search through the space of possible hypotheses for one that will perform
well, even on new examples beyond the training set.
To measure the accuracy of a hypothesis we give it a test set of examples that are distinct
from the training set. We say a hypothesis generalizes well if it correctly predicts the value
of y for novel examples. Sometimes the function f is stochastic—it is not strictly a function
of x, and what we have to learn is a conditional probability distribution, P(Y | x).
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the
learning problem is called classification, and is called Boolean or binary classification if
there are only two values.
When y is a number (such as tomorrow’s temperature), the learning problem is called
regression. (Technically, solving a regression problem is finding a conditional expectation
or average value of y, because the probability that we have found exactly the right real-
valued number for y is 0.)
In Supervised learning, an AI system is presented with data which is labeled, which means that
each data tagged with the correct label.
The goal is to approximate the mapping function so well that when you have new input data (x)
that you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers, the algorithm iteratively makes predictions on the training data and is corrected by the
teacher. Learning stops when the algorithm achieves an acceptable level of performance.
4
Collected by Bipin Timalsina
Machine Learning
As shown in the above example, we have initially taken some data and marked them as ‘Spam’ or
‘Not Spam’. This labeled data is used by the training supervised model, this data is used to train
the model.
Once it is trained we can test our model by testing it with some test new mails and checking of the
model is able to predict the right output.
Supervised learning is the most mature, the most studied and the type of learning used by most
machine learning algorithms. Learning with supervision is much easier than learning without
supervision
5
Collected by Bipin Timalsina
Machine Learning
Unsupervised Learning
In unsupervised learning the agent learns patterns in the input even though no explicit
feedback is supplied.
The most common unsupervised learning task is clustering: detecting potentially useful
clusters of input examples. For example, a taxi agent might gradually develop a concept of
“good traffic days” and “bad traffic days” without ever being given labeled examples of
each by a teacher.
In unsupervised learning, an AI system is presented with unlabeled, uncategorized data and the
system’s algorithms act on the data without prior training. The output is dependent upon the
coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.
In the above example, we have given some characters to our model which are ‘Ducks’ and ‘Not
Ducks’. In our training data, we don’t provide any label to the corresponding data. The
unsupervised model is able to separate both the characters by looking at the type of data and models
the underlying structure or distribution in the data in order to learn more about it.
Input data is not labeled and does not have a known result.
6
Collected by Bipin Timalsina
Machine Learning
A model is prepared by deducing structures present in the input data. This may be to extract
general rules. It may be through a mathematical process to systematically reduce
redundancy, or it may be to organize data by similarity.
It is called unsupervised learning because unlike supervised learning above there is no
correct answers and there is no teacher. Algorithms are left to their own devises to discover
and present the interesting structure in the data
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Reinforcement learning
In reinforcement learning the agent learns from a series of reinforcements—rewards or
punishments. For example, the lack of a tip at the end of the journey gives the taxi agent
an indication that it did something wrong. The two points for a win at the end of a chess
game tells the agent it did something right. It is up to the agent to decide which of the
actions prior to the reinforcement were most responsible for it.
An agent can learn from success and failure, from reward and punishment.
A reinforcement learning algorithm, or agent, learns by interacting with its environment. The agent
receives rewards by performing correctly and penalties for performing incorrectly. The agent
learns without intervention from a human by maximizing its reward and minimizing its penalty. It
is a type of dynamic programming that trains algorithms using a system of reward and punishment.
7
Collected by Bipin Timalsina
Machine Learning
In the above example, we can see that the agent is given two options i.e. a path with water or a
path with fire. A reinforcement algorithm works on reward a system i.e. if the agent uses the fire
path then the rewards are subtracted and agent tries to learn that it should avoid the fire path. If it
had chosen the water path or the safe path then some points would have been added to the reward
points, the agent then would try to learn what path is safe and what path isn’t.
It is basically leveraging the rewards obtained, the agent improves its environment knowledge to
select the next action
(Examples and some descriptions for supervised, unsupervised and reinforcemnet learning are
taken from this source : https://towardsdatascience.com/introduction-to-machine-learning-for-
beginners-eed6024fdb08)
8
Collected by Bipin Timalsina
Machine Learning
Semi-Supervised Learning
In semi-supervised learning we are given a few labeled examples and must make what
we can of a large collection of unlabeled examples.
Problems where you have a large amount of input data (X) and only some of the data is
labeled (Y) are called semi-supervised learning problems.
These problems sit in between both supervised and unsupervised learning.
A good example is a photo archive where only some of the images are labeled, (e.g. dog,
cat, person) and the majority are unlabeled.
Many real world machine learning problems fall into this area. This is because it can be
expensive or time-consuming to label data as it may require access to domain experts.
Whereas unlabeled data is cheap and easy to collect and store.
You can use unsupervised learning techniques to discover and learn the structure in the
input variables.
You can also use supervised learning techniques to make best guess predictions for the
unlabeled data, feed that data back into the supervised learning algorithm as training data
and use the model to make predictions on new unseen data.
9
Collected by Bipin Timalsina
Machine Learning
Bayesian classification is based on Bayes’ theorem, named after Thomas Bayes (1702-
1761).
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. For example, a fruit may be considered to
be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend
on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is
known as ‘Naive’.
Bayes Theorem
Let X be a data sample whose class label is unknown.
Let H be some hypothesis: such that the data sample X belongs to a specific class C.
We want to determine the probability that the hypothesis H holds given the observed data
sample X (i.e. P(H|X)).
P(H|X) is the posterior probability representing our confidence in the hypothesis after X
is given. In contrast, P(H) is the prior probability of H for any sample, regardless of how
the data in the sample look.
The posterior probability P(H|X) is based on more information than the prior probability
P(H).
The Bayesian theorem provides a way of calculating the posterior probability P(H|X) using
probabilities P(H), P(X), and P(X|H). The basic relation is:
𝑃(𝑋|𝐻) × 𝑃(𝐻)
𝑃(𝐻|𝑋) =
𝑃(𝑋)
Or, the probability that an event H occurs given that another event X has already occurred
is equal to the probability that the event X occurs given H has already occurred multiplied
by probability that event H occurs divided by probability of occurrence of X.
10
Collected by Bipin Timalsina
Machine Learning
For example, suppose our world of data tuples is confined to customers described by the attributes
age and income, and that X is a 35-year-old customer with an income of $40,000. Suppose that H
is the hypothesis that our customer will buy a computer. Then,
P(H|X) → the probability that customer X will buy a computer given that we know the
customer’s age and income. It is the posterior probability, or a posteriori probability, of H
conditioned on X
P(X|H) → the probability that a customer, X, is 35 years old and earns $40,000, given that
we know the customer will buy a computer. It is the posterior probability of X conditioned
on H.
P(H) → the probability that any given customer will buy a computer, regardless of age,
income, or any other information. It is the prior probability, or a priori probability, of H.
P(X) → the probability that a person from our set of customers is 35 years old and earns
$40,000. It is the prior probability of X.
Naïve Bayesian Classification
Let D be a training set of tuples and their associated class labels.
Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤m, j ≠ i
Where,
𝑃(𝑋|𝐶𝑖 ) × 𝑃(𝐶𝑖 )
𝑃(𝐶𝑖 |𝑋) =
𝑃(𝑋)
Here, P(X) is constant for all classes. So, only P(X|Ci) × P(Ci) needs to be maximized.
It would be extremely computationally expensive to compute P(X|Ci). In order to reduce
computation in evaluating P(X|Ci), the naïve assumption of class conditional independence is
made. This presumes that the values of the attributes are conditionally independent of one another,
given the class label of the tuple (i.e., that there are no dependence relationships among the
attributes). Thus,
𝑛
11
Collected by Bipin Timalsina
Machine Learning
For example,
ID Age Income Student Credit_Rating Buy_Computer?
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
14 Senior Medium No Excellent No
Table : Buys_Computer data
Test data: X:(Age=Youth, Income=Medium, Student=Yes, Credit_Rating=Fair)
Let C1: Buys_Computer=Yes
C2: Buys_Computer=No
So,
P(C1)=P(Buys_Computer=Yes) = 9/14 = 0.643
P(C2)=P(Buys_Computer=No) = 5/14 = 0.357
To compute, 𝑃(𝑋|𝐶𝑖 ) for i=1,2 we first compute following conditional probabilities:
P(Age = Youth | Buys_Computer = Yes) = 2/9 = 0.222
P(Age = Youth | Buys_Computer = No) = 3/5 = 0.600
P(Income = Medium | Buys_Computer = Yes) = 4/9 = 0.444
P(Income = Medium | Buys_Computer = No) = 2/5 = 0.400
P(Student = Yes | Buys_Computer = Yes) = 6/9 = 0.667
P(Student = Yes | Buys_Computer = No) = 1/5 = 0.200
P(Credit_Rating = Fair | Buys_Computer = Yes) = 6/9 = 0.667
12
Collected by Bipin Timalsina
Machine Learning
Another Example:
Following is a training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing). Now, we need to classify whether players will play or not based on
weather condition. Let’s follow the below steps to perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
13
Collected by Bipin Timalsina
Machine Learning
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to
perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
14
Collected by Bipin Timalsina
Machine Learning
Optimization refers to finding the values of inputs in such a way that we get the “best”
output values. The definition of “best” varies from problem to problem, but in
mathematical terms, it refers to maximizing or minimizing one or more objective
functions, by varying the input parameters.
The set of all possible solutions or values which the inputs can take make up the search
space. In this search space, lies a point or a set of points which gives the optimal solution.
The aim of optimization is to find that point or set of points in the search space.
15
Collected by Bipin Timalsina
Machine Learning
new children, and the process is repeated over various generations. Each individual (or
candidate solution) is assigned a fitness value (based on its objective function value) and
the fitter individuals are given a higher chance to mate and yield more “fitter” individuals.
This is in line with the Darwinian Theory of “Survival of the Fittest”.
In this way we keep “evolving” better individuals or solutions over generations, till we
reach a stopping criterion.
The Darwin theory of Evolution explain natural selection as
– Individuals pass on traits to offspring
– Individuals have different traits
– Fittest individuals survive to produce more offspring
– Over time, variation can accumulate leading to new species
A genetic or evolutionary algorithm applies the principles of evolution found in nature to
the problem of finding an optimal solution to a Solver problem. In a "genetic algorithm,"
the problem is encoded in a series of bit strings that are manipulated by the algorithm.
16
Collected by Bipin Timalsina
Machine Learning
Basic Terminologies in GA
Population − It is a subset of all the possible (encoded) solutions to the given problem.
The population for a GA is analogous to the population for human beings except that
instead of human beings, we have Candidate Solutions representing human beings.
Chromosomes − A chromosome is one such solution to the given problem. A
chromosome consists of genes, commonly referred as blocks of DNA, where each gene
encodes a specific trait, for example hair color or eye color (in genetics).
Gene − A gene is one element position of a chromosome.
In a genetic algorithm, the set of genes of an individual is represented using a string,
in terms of an alphabet. Usually, binary values are used (string of 1s and 0s). We
say that we encode the genes in a chromosome.
17
Collected by Bipin Timalsina
Machine Learning
Genotype − Genotype is the population in the computation space. In the computation space, the
solutions are represented in a way which can be easily understood and manipulated using a
computing system.
Phenotype − Phenotype is the population in the actual real world solution space in which solutions
are represented in a way they are represented in real world situations.
Decoding and Encoding − For simple problems, the phenotype and genotype spaces are the same.
However, in most of the cases, the phenotype and genotype spaces are different. Decoding is a
process of transforming a solution from the genotype to the phenotype space, while encoding is a
process of transforming from the phenotype to genotype space. Most common method of encoding
is binary encoding. There are also other encoding techniques.
Fitness function
The fitness function determines how fit an individual is (the ability of an individual to compete
with other individuals). It gives a fitness score to each individual. The probability that an
individual will be selected for reproduction is based on its fitness score.
In some cases, the fitness function and the objective function may be the same, while in others it
might be different based on the problem.
18
Collected by Bipin Timalsina
Machine Learning
Genetic operators
A genetic operator is an operator used in genetic algorithms to guide the algorithm towards a
solution to a given problem.
1. Selection
2. Crossover
3. Mutation
Selection
The idea of selection operator is to select the fittest individuals and let them pass their genes to the
next generation.
Two pairs of individuals (parents) are selected based on their fitness scores. Individuals with high
fitness have more chance to be selected for reproduction.
Crossover
Crossover is the most significant phase in a genetic algorithm. For each pair of parents to be
mated, a crossover point is chosen at random from within the genes.
19
Collected by Bipin Timalsina
Machine Learning
Offspring are created by exchanging the genes of parents among themselves until the crossover
point is reached.
New offsprings
Mutation
In certain new offspring formed, some of their genes can be subjected to a mutation with a low
random probability. This implies that some of the bits in the bit string can be flipped.
20
Collected by Bipin Timalsina
Machine Learning
Mutation occurs to maintain diversity within the population and prevent premature convergence.
The algorithm terminates if the population has converged (does not produce offspring which are
significantly different from the previous generation). Then it is said that the genetic algorithm has
provided a set of solutions to our problem
21
Collected by Bipin Timalsina
Machine Learning
22
Collected by Bipin Timalsina
Machine Learning
A neuron is a cell in brain whose principle function is the collection, Processing, and
dissemination of electrical signals. Brains Information processing capacity comes from
networks of such neurons. Due to this reason some earliest AI work aimed to create such
artificial networks. (Other Names are Connectionism; Parallel distributed processing and
neural computing).
Artificial Neural Network (ANN) is an efficient computing system whose central theme is
borrowed from the analogy of biological neural networks. ANNs are also named as “artificial
neural systems,” or “parallel distributed processing systems,” or “connectionist systems.” ANN
acquires a large collection of units that are interconnected in some pattern to allow communication
between the units. These units, also referred to as nodes or neurons, are simple processors which
operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection link is
associated with a weight that has information about the input signal. This is the most useful
information for neurons to solve a particular problem because the weight usually excites or inhibits
the signal that is being communicated. Each neuron has an internal state, which is called an
activation signal. Output signals, which are produced after combining the input signals and
activation rule, may be sent to other units.
We can think Artificial Neural Network as computational model that is inspired by the way
biological neural networks in the human brain process information.
23
Collected by Bipin Timalsina
Machine Learning
24
Collected by Bipin Timalsina
Machine Learning
Neural networks process information in a similar way the human brain does. The network is
composed of a large number of highly interconnected processing elements (neurones) working in
parallel to solve a specific problem. Neural networks learn by example. They cannot be
programmed to perform a specific task. The examples must be selected carefully otherwise useful
time is wasted or even worse the network might be functioning incorrectly. The disadvantage is
that because the network finds out how to solve the problem by itself, its operation can be
unpredictable.
On the other hand, conventional computers use a cognitive approach to problem solving; the way
the problem is to solved must be known and stated in small unambiguous instructions. These
instructions are then converted to a high level language program and then into machine code that
the computer can understand. These machines are totally predictable; if anything goes wrong is
due to a software or hardware fault.
Parallel processing
One of the major advantages of the neural network is its ability to do many things at once. With
traditional computers, processing is sequential--one task, then the next, then the next, and so on.
The idea of threading makes it appear to the human user that many things are happening at one
time. For instance, the Netscape throbber is shooting meteors at the same time that the page is
25
Collected by Bipin Timalsina
Machine Learning
loading. However, this is only an appearance; processes are not actually happening
simultaneously.
Based upon the way they function, traditional computers have to learn by rules, while artificial
neural networks learn by example, by doing something and then learning from it. Because of these
fundamental differences, the applications to which we can tailor them are extremely different. We
will explore some of the applications later in the presentation.
Self-programming
The "connections" or concepts learned by each type of architecture is different as well. The von
Neumann computers are programmable by higher level languages like C or Java and then
translating that down to the machine's assembly language. Because of their style of learning,
artificial neural networks can, in essence, "program themselves." While the conventional
computers must learn only by doing different sequences or steps in an algorithm, neural networks
are continuously adaptable by truly altering their own programming. It could be said that
conventional computers are limited by their parts, while neural networks can work to become more
than the sum of their parts.
Speed
The speed of each computer is dependant upon different aspects of the processor. Von Neumann
machines requires either big processors or the tedious, error-prone idea of parallel processors,
while neural networks requires the use of multiple chips customly built for the application.
26
Collected by Bipin Timalsina
Machine Learning
A nerve cell (neuron) is a special biological cell that processes information. According to an
estimation, there are huge number of neurons, approximately 1011 with numerous
interconnections, approximately 1015
Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites.
27
Collected by Bipin Timalsina
Machine Learning
Soma Node
Dendrites Input
Axon Output
28
Collected by Bipin Timalsina
Machine Learning
Figure: A simple mathematical model for a neuron. The unit’s output activation is 𝑎𝑗 =
𝑔(∑𝑛0 𝑤𝑖,𝑗 𝑎𝑖 ) where , where 𝑎𝑖 is the output activation of unit i and 𝑤𝑖,𝑗 is the weight on the link
from unit i to this unit.
Neural networks are composed of nodes or units connected by directed links. A link from unit i to
unit j serves to propagate the activation 𝑎𝑖 from i to j. Each link also has a numeric weight 𝑤𝑖,𝑗
associated with it, which determines the strength and sign of the connection. Just as in linear
regression models, each unit has a dummy input a0 =1 with an associated weight 𝑤0,𝑗 . Each unit
j first computes a weighted sum of its inputs:
𝑛
𝑖𝑛𝑗 = ∑ 𝑤𝑖,𝑗 𝑎𝑖
0
29
Collected by Bipin Timalsina
Machine Learning
Activation Function
Activation function decides, whether a neuron should be activated or not by calculating
weighted sum of inputs (further adding bias with it) .
Activation function is a function used to transform the activation level of a unit (neuron)
into an output signal
Activation function of a node defines the output of that node given an input or set of inputs
Also called transfer function and squashing function
Activation functions perform a transformation on the input received, in order to keep values
within a manageable range.
Activations functions falls under one of the following three categories :
o Binary step functions
o Linear functions
o Non Linear functions
Mathematically,
0 𝑖𝑓 𝑥 < 𝑘
f(x) = {
1 𝑖𝑓 𝑥 ≥ 𝑘
30
Collected by Bipin Timalsina
Machine Learning
31
Collected by Bipin Timalsina
Machine Learning
Modern neural network models use non-linear activation functions. They allow the model
to create complex mappings between the network’s inputs and outputs, which are essential
for learning and modeling complex data, such as images, video, audio, and data sets which
are non-linear or have high dimensionality.
Almost any process imaginable can be represented as a functional computation in a neural
network, provided that the activation function is non-linear.
Non-linear functions address the problems of a linear activation function:
o They allow backpropagation because they have a derivative function which is
related to the inputs.
o They allow “stacking” of multiple layers of neurons to create a deep neural
network. Multiple hidden layers of neurons are needed to learn complex data sets
with high levels of accuracy
Sigmoid, Hyperbolic Tan , ReLu etc are some examples
32
Collected by Bipin Timalsina
Machine Learning
The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).
33
Collected by Bipin Timalsina
Machine Learning
ReLU stands for rectified linear unit, and is a type of activation function. Mathematically, it is
defined as
𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥)
Range : 0 to infinity
34
Collected by Bipin Timalsina
Machine Learning
𝟎 𝒊𝒇 𝒙 < 𝟎
Activation function 𝒇(𝒙) = {
𝟏 𝒊𝒇 𝒙 ≥ 𝟏
1
1
.
5
-
1
5
35
Collected by Bipin Timalsina
Machine Learning
b =0.5, w1 = -1
36
Collected by Bipin Timalsina
Machine Learning
Network structures
Feed forward network
Feed-forward ANNs allow signals to travel one way only; from input to output. There is
no feedback (loops) i.e. the output of any layer does not affect that same layer. Feed-
forward ANNs tend to be straight forward networks that associate inputs with outputs.
They are extensively used in pattern recognition. This type of organization is also referred
to as bottom-up or top-down.
It is a non-recurrent network having processing units/nodes in layers and all the nodes in a
layer are connected with the nodes of the previous layers. The connection has different
weights upon them. There is no feedback loop means the signal can only flow in one
direction, from input to output. It may be divided into the following two types −
37
Collected by Bipin Timalsina
Machine Learning
Feedback networks can have signals traveling in both directions by introducing loops in
the network. Feedback networks are very powerful and can get extremely complicated.
Feedback networks are dynamic; their 'state' is changing continuously until they reach an
equilibrium point. They remain at the equilibrium point until the input changes and a new
equilibrium needs to be found. Feedback architectures are also referred to as interactive or
recurrent.
As the name suggests, a feedback network has feedback paths, which means the signal can
flow in both directions using loops. This makes it a non-linear dynamic system, which
changes continuously until it reaches a state of equilibrium
38
Collected by Bipin Timalsina
Machine Learning
Feed-forward example
39
Collected by Bipin Timalsina
Machine Learning
The neural network which contains input layers, output layers and some hidden layers also is called
multilayer neural network. The advantage of adding hidden layers is that it enlarges the space of
hypothesis. Layers of the network are normally fully connected.
Once the number of layers, and number of units in each layer, has been selected, training is used
to set the network's weights and thresholds so as to minimize the prediction error made by the
network
Training is the process of adjusting weights and threshold to produce the desired result for different
set of data.
40
Collected by Bipin Timalsina
Machine Learning
The operation of a neural network is determined by the values of the interconnection weights.
There is no algorithm that determines how the weights should be assigned in order to solve specific
problems. Hence, the weights are determined by a learning process
Supervised Learning
In supervised learning, the network is presented with inputs together with the target
(teacher signal) outputs. Then, the neural network tries to produce an output as close as
possible to the target signal by adjusting the values of internal weights. The most common
supervised learning method is the “error correction method”.
Error correction method is used for networks which their neurons have discrete output
functions. Neural networks are trained with this method in order to reduce the error
(difference between the network's output and the desired output) to zero
41
Collected by Bipin Timalsina
Machine Learning
Unsupervised Learning
In unsupervised learning, there is no teacher (target signal) from outside and the network
adjusts its weights in response to only the input patterns. A typical example of unsupervised
learning is Hebbian learning
Consider a machine (or living organism) which receives some sequence of inputs x1, x2, x3, . . .,
where xt is the sensory input at time t. In supervised learning the machine is given a sequence of
input & a sequence of desired outputs y1, y2, . . . , and the goal of the machine is to learn to produce
42
Collected by Bipin Timalsina
Machine Learning
the correct output given a new input. While, in unsupervised learning the machine simply receives
inputs x1, x2, . . ., but obtains neither supervised target outputs, nor rewards from its environment.
It may seem somewhat mysterious to imagine what the machine could possibly learn given that it
doesn’t get any feedback from its environment. However, it is possible to develop of formal
framework for unsupervised learning based on the notion that the machine’s goal is to build
representations of the input that can be used for decision making, predicting future inputs, efficiently
communicating the inputs to another machine, etc. In a sense, unsupervised learning can be thought of
as finding patterns in the data above and beyond what would be considered pure unstructured noise.
Hebbian Learning
The oldest and most famous of all learning rules is Hebb’s postulate of learning:
―When an axon of cell A is near enough to excite a cell B and repeatedly or persistently
takes part in firing it, some growth process or metabolic changes take place in one or both
cells such that A‘s efficiency as one of the cells firing B is increased.
From the point of view of artificial neurons and artificial neural networks, Hebb's principle can be
described as a method of determining how to alter the weights between model neurons. The weight
between two neurons increases if the two neurons activate simultaneously—and reduces if
they activate separately. Nodes that tend to be either both positive or both negative at the same
time have strong positive weights, while those that tend to be opposite have strong negative
weights
Hebbian learning is one of the oldest learning algorithms, and is based in large part on the dynamics
of biological systems. A synapse between two neurons is strengthened when the neurons on either
side of the synapse (input and output) have highly correlated outputs.
``When neuron A repeatedly and persistently takes part in exciting neuron B, the
synaptic connection from A to B will be strengthened.''
43
Collected by Bipin Timalsina
Machine Learning
From the above postulate, we can conclude that the connections between two neurons might be
strengthened if the neurons fire at the same time and might weaken if they fire at different times.
Here,
∆𝑤𝑗𝑖 (𝑡) = increment by which the weight of connection increases at time step t
𝛼 = the positive and constant learning rate
𝑥𝑖 (𝑡)= the input value from pre-synaptic neuron at time step t
𝑦𝑗 (𝑡)= the output of post-synaptic neuron at same time step t
44
Collected by Bipin Timalsina
Machine Learning
There are two popular weight update rules in perceptron learning. They are:
wi = wi + Δwi
Where
Δwi = η (t-o) xi
t: target output
45
Collected by Bipin Timalsina
Machine Learning
Algorithm:
Set wi(t), (0 <= i <= n), to be the weight i at time t, and ø to be the threshold
value in the output node. Set w0 to be -ø, the bias, and x0 to be always 1.
Set wi(0) to small random values, thus initializing the weights and threshold.
Present input x0, x1, x2, ..., xn and desired output d(t)
Steps iii. and iv. are repeated until the iteration error is less than a user-specified
error threshold or a predetermined number of iterations have been completed.
Please note that the weights only change if an error is made and hence this is only when learning
shall occur.
46
Collected by Bipin Timalsina
Machine Learning
Delta Rule
What happens if the examples are not linearly separable? To address this situation we try to
approximate the real concept using the delta rule. The key idea is to use a gradient descent
search. Delta rule is a gradient descent learning rule for updating the weights of the inputs
to artificial neurons in a single-layer neural network. It is a special case of the more
general backpropagation algorithm.
For a neuron 𝑗 with activation function 𝑔(𝑥)the delta rule for 𝑗 's 𝑖th weight 𝑤𝑗𝑖 is given by
Where,
𝛼 is learning rate
𝑡𝑗 is target output
𝑔′ is derivative of 𝑔
𝑦𝑗 is actual output
The delta rule is commonly stated in simplified form for a neuron with a linear activation function
as
47
Collected by Bipin Timalsina
Machine Learning
The delta rule is derived by attempting to minimize the error in the output of the neural network
through gradient descent. The error for a neural network with j outputs can be measured as
As the algorithm's name implies, the errors (and therefore the learning) propagate backwards from
the output nodes to the inner nodes. So technically speaking, backpropagation is used to calculate
the gradient of the error of the network with respect to the network's modifiable weights. This
gradient is almost always then used in a simple stochastic gradient descent algorithm, is a general
optimization algorithm, but is typically used to fit the parameters of a machine learning model, to
find weights that minimize the error. Often the term "backpropagation" is used in a more general
sense, to refer to the entire procedure encompassing both the calculation of the gradient and its use
in stochastic gradient descent. Backpropagation usually allows quick convergence on satisfactory
local minima for error in the kind of networks to which it is suited.
Backpropagation networks are necessarily multilayer perceptrons (usually with one input, one
hidden, and one output layer). In order for the hidden layer to serve any useful function, multilayer
networks must have non-linear activation functions for the multiple layers: a multilayer network
using only linear activation functions is equivalent to some single layer, linear network.
Back-propagation is the essence of neural net training. It is the practice of fine-tuning the
weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e.
iteration). Proper tuning of the weights ensures lower error rates, making the model reliable
by increasing its generalization.
48
Collected by Bipin Timalsina
Machine Learning
49
Collected by Bipin Timalsina
Machine Learning
network. It iteratively learns a set of weights for prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or more hidden
layers.
Backpropagation learns by iteratively processing a data set of training tuples, comparing the
network’s prediction for each tuple with the actual known target value. The target value may be
the known class label of the training tuple (for classification problems) or a continuous value (for
numeric prediction). For each training tuple, the weights are modified so as to minimize the mean-
squared error between the network’s prediction and the actual target value. These modifications
are made in the “backwards” direction (i.e., from the output layer) through each hidden layer down
to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in
general the weights will eventually converge, and the learning process stops.
50
Collected by Bipin Timalsina
Machine Learning
1. Initialization of weights
2. Feed forward
3. Back propagation of error
4. Updating of weights and biases
Algorithm:
Step 1: Feed the training sample through the network and determine the final output
Step 2: Compute the error for each output unit, for unit k it is:
51
Collected by Bipin Timalsina
Machine Learning
Step 3: Calculate the weight correction term for each output unit, for unit k it is:
Step 4: Propagate the delta terms (errors) back through the weights of the hidden units where the
delta input for the jth hidden unit is:
Step 5: Calculate the weight correction term for the hidden units:
Step 7: Test for stopping (maximum cycles, small changes, etc) and repeat from step 1 if not
terminated.
52
Collected by Bipin Timalsina