ML Unit-Iv
ML Unit-Iv
ML Unit-Iv
Artificial Neural Networks: Neurons and biological motivation, Linear threshold units.
Perceptrons: representational limitation and gradient descent training, Multilayer networks and
backpropagation, Hidden layers and constructing intermediate, distributed representations. Overfitting,
learning network structure, recurrent networks.
Support Vector Machines: Maximum margin linear separators. Quadratic programming solution to finding
maximum margin separators. Kernels for learning non-linear functions.
…………………………………………………………………………………………………………………………….
• A Neural Network is a system designed to operate like a human brain. Human information
processing takes place through the interaction of many billions of neurons connected to each other
sending signals to other neurons.
• Similarly, a Neural Network is a network of artificial neurons, as found in human brains, for solving
artificial intelligence problems such as image identification. They may be a physical device or
mathematical constructs.
• In other words, Artificial Neural Network is a parallel computational system consisting of many
simple processing elements connected to perform a particular task.
Biological Motivation
Motivation behind neural network is human brain. Human brain is called as the best processor even though
it works slower than other computers. Many researchers thought to make a machine that would work in the
prospective of the human brain.
Human brain contains billion of neurons which are connected to many other neurons to form a network so
that if it sees any image, it recognizes the image and processes the output.
• When the sum reaches a threshold value, neuron fires and the signal travels down the axon to the
other neurons.
• The amount of signal transmitted depend upon the strength of the connections.
• Connections can be inhibitory, i.e. decreasing strength or excitatory, i.e. increasing strength in nature.
In the similar manner, it was thought to make artificial interconnected neurons like biological neurons
making up an Artificial Neural Network(ANN). Each biological neuron is capable of taking a number of
inputs and produce output.
Neurons in human brain are capable of making very complex decisions, so this means they run many parallel
processes for a particular task. One motivation for ANN is that to work for a particular task identification
through many parallel processes.
Artificial Neuron
Artificial Neuron are also called as perceptrons. This consist of the following basic terms:
• Input
• Weight
• Bias
• Activation Function
Below is an LTU with the activation function being the step function.
3. Perception
The original Perceptron was designed to take a number of binary inputs, and produce one binary output (0
or 1).
The idea was to use different weights to represent the importance of each input, and that the sum of the
values should be greater than a threshold value before making a decision like true or false (0 or 1).
• Threshold = 1.5
• x1 * w1 = 1 * 0.7 = 0.7
• x2 * w2 = 0 * 0.6 = 0
• x3 * w3 = 1 * 0.5 = 0.5
• x4 * w4 = 0 * 0.3 = 0
• x5 * w5 = 1 * 0.4 = 0.4
• Return true if the sum > 1.5 ("Yes I will go to the Concert")
1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer
function.
2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are non-
linear, it is not easy to classify them correctly.
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks. It helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at the current point,
it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function at the current
point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main objective
of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve this goal, it
performs two steps iteratively:
• Calculates the first-order derivative of the function to compute the gradient or slope of that function.
• Move away from the direction of the gradient, which means slope increased from the current point by
alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization
process which helps to decide the length of the steps.
The cost function is defined as the measurement of difference or error between actual values and expected
values at the current position and present in the form of a single real number.
Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is
called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required.
Each and every node in the nth layer will be connected to each and every node in the (n-1)th layer(n>1). So,
the input from the input layer is multiplied with the associated weights of every link and will be traversed till
the output layer for the final ouput. In case of any error, unlike perceptron, in this case we might need to
update several weight vectors in many hidden layers. This is where Back Propagation comes into place. It’s
nothing but updation of the weight vectors in the hidden layers according to the training error or the loss
produced in the ouput layer.
In this post, we are considering mutiple output units rather than a single output unit as discussed in our previous
post. Therefore the formula for calculating training error for a neural network can be represented as follows:
Error function
in multi-layer neural networks
• t and o are target values and the output values produced by the network for the kth output unit for
data point ‘d’.
Now that we have the error function, input and output units we need to know the rule for updation of weight
vector. Before that let’s know about one of the most common activation functions used in multi layer neural
networks i.e sigmoid function.
A sigmoid function is any function which is continuously differentiable be it e^x or hyberbolic tangent(tanh)
which produces the output in the range of 0 to 1 ( not including 0 and 1). It can be represented as:
Sigmoid
Function
where, y is the linear combination of input vector and the weight vector at a given node.
Now, let’s know how the weight vectors are updated in multi layer networks according to Back Propagation
Algorithm.
• Input the first data point into the network and calculate the output for each output unit and let it be
‘o’ for every unit ‘u’.
• For each output unit ‘k’, training error ‘ 𝛿 ‘ can be calculated by the given formula:
• weight vector from jth node to ith node is updated using above formula in which ‘η’ is the learning
rate, ‘𝛿’ is the training error and ‘x’ is the input vector for the given node.
The above algorithm is continuously implemented on all data points until we specify a termination criterion,
which can be implemented in either of these three ways:
• setting the threshold to an error, if the error goes below the given threshold, we can stop training the
neural network further.
• Creating a validation sample of data, after every iteration we validate our model with this data and
the iteration with the highest accuracy can be considered as the final model.
The first way of termination might not yield us better results , the most recommended way is the third way as
we are aware of the accuracy of our model so far.
Consider, for example, the network shown in Figure 4.7. Here, the eight network inputs are connected to
three hidden units, which are in turn connected to the eight output units. Because of this structure, the three
hidden units will be forced to re-represent the eight input values in some way that captures their relevant
features, so that this hidden layer representation can be used by the output units to compute the correct target
values.
When BACKPROPAGATION is applied to this task, using each of the eight possible vectors as training
examples, it successfully learns the target function. What hidden layer representation is created by the gradient
descent BACKPROPAGATION algorithm? By examining the hidden unit values generated by the learned
network for each of the eight possible input vectors, it is easy to see that the learned encoding is similar to the
familiar standard binary encoding of eight values using three bits (e.g., 000,001,010,. . . , 111). The exact values
of the hidden units for one typical run of BACKPROPAGATION are shown in Figure 4.7
This ability of multilayer networks to automatically discover useful representations at the hidden layers is a key
feature of ANN learning.
“Distributed representation” means a many-to- many relationship between two types of representation (such
as concepts and neurons).
In Overfitting, the model tries to learn too many details in the training data along with the noise from the
training data. As a result, the model performance is very poor on unseen or test datasets. Therefore, the
network fails to generalize the features or patterns present in the training dataset.
Overfitting during training can be spotted when the error on training data decreases to a very small value but
the error on the new data or test data increases to a large value.
When the network tries to learn from a small dataset it will tend to have greater control over the dataset &
will make sure to satisfy all the data points exactly. So, the network is trying to memorize every single data
point and failing to capture the general trend from the training dataset.
Overfitting also occurs when the model tries to make predictions on data that is very noisy, which is caused
due to an overly complex model having too many parameters. So, due to this, the overfitted model is
inaccurate as the trend does not reflect the reality present in the data.
Deep neural networks are prone to overfitting because they learn millions or billions of parameters while
building the model. A model having this many parameters can overfit the training data because it has sufficient
capacity to do so.
The basic idea to deal with the problem of overfitting is to decrease the complexity of the model. To do so,
we can make the network smaller by simply removing the layers or reducing the number of neurons, etc.
How does Overfitting get reduced when we remove the layers or Reduce the Number of Neurons?
By removing some layers or reducing the number of neurons the network becomes less prone to overfitting as
the neurons contributing to overfitting are removed or deactivated. Therefore, the network has a smaller
number of parameters to learn because of which it cannot memorize all the data points & will be forced to
generalize.
But while using this technique to resolve the issue, one has to keep in mind to compute the input and output
dimensions of the various layers involved in the neural network.
There is no thumb of the rule to find the answer to the above questions, but there are some popular approaches
to do this which are described below:
• Grid Search: Apply Grid search Cross-Validation to find out the number of neurons or layers to reduce.
• Trimming: We can also prune our overfitted model by removing nodes or connections until it reaches
suitable performance on unseen datasets after model building.
So, In simple words in this technique, our aim is to make the neural network smaller to prevent it from
overfitting.
Recurrent networks:
Recurrent Neural Network (RNN) are a type of Neural Network where the output from previous step are fed
as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each
other, but in cases like when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words. Thus, RNN came into existence, which
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is Hidden
state, which remembers some information about a sequence.
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and one output layer. Then like
other neural networks, each hidden layer will have its own set of weights and biases, let’s say, for hidden layer
1 the weights and biases are (w1, b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer.
This means that each of these layers are independent of each other, i.e. they do not memorize the previous
outputs.
• RNN converts the independent activations into dependent activations by providing the same weights
and biases to all the layers, thus reducing the complexity of increasing parameters and memorizing each
previous outputs by giving each output as input to the next hidden layer.
• Hence these three layers can be joined together such that the weights and bias of all the hidden layers
is the same, into a single recurrent layer.
……………………………………………………………………………………………………………………………..
“The support vector machine (SVM) is a supervised learning method that generates input-output mapping
functions from a set of labeled training data." A Support Vector Machine (SVM) performs classification by
finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the
hyperplane are the support vectors.
Algorithm:
2. Extend the above definition for non-linearly separable problems: have a penalty term for
misclassifications.
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces:
reformulate problem so that data is mapped implicitly to this space.
To define an optimal hyperplane we need to maximize the width of the margin (w).
The beauty of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal
SVM analysis should produce a hyperplane that completely separates the vectors (cases) into two non-
overlapping classes. However, perfect separation may not be possible, or it may result in a model with so
many cases that the model does not classify correctly. In this situation SVM finds the hyperplane that maximizes
the margin and minimizes the misclassifications.
For the maximum margin hyperplane only examples on the margin matter (only these affect the distances).
These are called support vectors. The objective of the support vector machine algorithm is to find a hyperplane
in an N-dimensional space (N — the number of features) that distinctly classifies the data points.
Hyperplanes
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the
hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the
number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support Vectors
Support vectors are data points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support
vectors will change the position of the hyperplane. These are the points that help us build our SVM. It will be
useful computationally if only a small fraction of the datapoints are support vectors, because we use the
support vectors to decide which side of the separator a test case is on.
To find the maximum margin the separator, we have to solve following optimization problem:
w.xc+b>+1
w.xc+b<−1
and ||w||2
is as small as possible