ML Unit-Iv

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

UNIT IV

Artificial Neural Networks: Neurons and biological motivation, Linear threshold units.

Perceptrons: representational limitation and gradient descent training, Multilayer networks and
backpropagation, Hidden layers and constructing intermediate, distributed representations. Overfitting,
learning network structure, recurrent networks.

Support Vector Machines: Maximum margin linear separators. Quadratic programming solution to finding
maximum margin separators. Kernels for learning non-linear functions.

…………………………………………………………………………………………………………………………….

1. Neurons and biological motivation


What is an Artificial Neural Network?

• A Neural Network is a system designed to operate like a human brain. Human information
processing takes place through the interaction of many billions of neurons connected to each other
sending signals to other neurons.

• Similarly, a Neural Network is a network of artificial neurons, as found in human brains, for solving
artificial intelligence problems such as image identification. They may be a physical device or
mathematical constructs.

• In other words, Artificial Neural Network is a parallel computational system consisting of many
simple processing elements connected to perform a particular task.

Biological Motivation

Motivation behind neural network is human brain. Human brain is called as the best processor even though
it works slower than other computers. Many researchers thought to make a machine that would work in the
prospective of the human brain.
Human brain contains billion of neurons which are connected to many other neurons to form a network so
that if it sees any image, it recognizes the image and processes the output.

1 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


• Dendrite receives signals from other neurons.

• Cell body sums the incoming signals to generate input.

• When the sum reaches a threshold value, neuron fires and the signal travels down the axon to the
other neurons.

• The amount of signal transmitted depend upon the strength of the connections.

• Connections can be inhibitory, i.e. decreasing strength or excitatory, i.e. increasing strength in nature.

In the similar manner, it was thought to make artificial interconnected neurons like biological neurons
making up an Artificial Neural Network(ANN). Each biological neuron is capable of taking a number of
inputs and produce output.
Neurons in human brain are capable of making very complex decisions, so this means they run many parallel
processes for a particular task. One motivation for ANN is that to work for a particular task identification
through many parallel processes.

2. Structure of Neural Network / Linear threshold units

Artificial Neuron

Artificial Neuron are also called as perceptrons. This consist of the following basic terms:
• Input
• Weight
• Bias
• Activation Function

2 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


• Output
Artificial neuron, also called linear threshold unit (LTU), by McCulloch and Pitts, 1943: with one or more
numeric inputs, it produces a weighted sum of them, applies an activation function, and outputs the result.

Common activation functions: step function and sigmoid function

Below is an LTU with the activation function being the step function.

3. Perception
The original Perceptron was designed to take a number of binary inputs, and produce one binary output (0
or 1).

The idea was to use different weights to represent the importance of each input, and that the sum of the
values should be greater than a threshold value before making a decision like true or false (0 or 1).

3 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


Perceptron Example

Imagine a perceptron (in your brain).

The perceptron tries to decide if you should go to a concert.

Is the artist good? Is the weather good?

What weights should these facts have?

Criteria Input Weight

Artists is Good x1 = 0 or 1 w1 = 0.7

Weather is Good x2 = 0 or 1 w2 = 0.6

Friend will Come x3 = 0 or 1 w3 = 0.5

Food is Served x4 = 0 or 1 w4 = 0.3

Alcohol is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm

Frank Rosenblatt suggested this algorithm:

1. Set a threshold value

2. Multiply all inputs with its weights

3. Sum all the results

4. Activate the output

1. Set a threshold value:

• Threshold = 1.5

2. Multiply all inputs with its weights:

• x1 * w1 = 1 * 0.7 = 0.7

• x2 * w2 = 0 * 0.6 = 0

• x3 * w3 = 1 * 0.5 = 0.5

• x4 * w4 = 0 * 0.3 = 0

• x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

• 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


4. Activate the Output:

• Return true if the sum > 1.5 ("Yes I will go to the Concert")

3.1 Representational limitation and gradient descent training

The following are the limitation of a Perceptron model:

1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer
function.

2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are non-
linear, it is not easy to classify them correctly.

Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is as follows:

• If we move towards a negative gradient or away from the gradient of the function at the current point,
it will give the local minimum of that function.

• Whenever we move towards a positive gradient or towards the gradient of the function at the current
point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main objective
of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve this goal, it
performs two steps iteratively:

• Calculates the first-order derivative of the function to compute the gradient or slope of that function.

• Move away from the direction of the gradient, which means slope increased from the current point by
alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization
process which helps to decide the length of the steps.

5 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


What is Cost-function?

The cost function is defined as the measurement of difference or error between actual values and expected
values at the current position and present in the form of a single real number.

How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:

Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is
called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required.

3.2 Multilayer networks and backpropagation

6 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


Firstly, let’s know how do a multi layer neural network looks like. In a multi layer neural network, there will
be one input layer, one output layer and one or more hidden layers.

Representation of a Multi Layer Neural Network

Each and every node in the nth layer will be connected to each and every node in the (n-1)th layer(n>1). So,
the input from the input layer is multiplied with the associated weights of every link and will be traversed till
the output layer for the final ouput. In case of any error, unlike perceptron, in this case we might need to
update several weight vectors in many hidden layers. This is where Back Propagation comes into place. It’s
nothing but updation of the weight vectors in the hidden layers according to the training error or the loss
produced in the ouput layer.

BACK PROPAGATION ALGORITHM

In this post, we are considering mutiple output units rather than a single output unit as discussed in our previous
post. Therefore the formula for calculating training error for a neural network can be represented as follows:

Error function
in multi-layer neural networks

• outputs is the set of output units in network

• d is the data point

• t and o are target values and the output values produced by the network for the kth output unit for
data point ‘d’.

Now that we have the error function, input and output units we need to know the rule for updation of weight
vector. Before that let’s know about one of the most common activation functions used in multi layer neural
networks i.e sigmoid function.

A sigmoid function is any function which is continuously differentiable be it e^x or hyberbolic tangent(tanh)
which produces the output in the range of 0 to 1 ( not including 0 and 1). It can be represented as:

Sigmoid
Function

where, y is the linear combination of input vector and the weight vector at a given node.

Now, let’s know how the weight vectors are updated in multi layer networks according to Back Propagation
Algorithm.

Updation of weights in Back Propagation

The algorithm can be represented in step-wise manner:

• Input the first data point into the network and calculate the output for each output unit and let it be
‘o’ for every unit ‘u’.

• For each output unit ‘k’, training error ‘ 𝛿 ‘ can be calculated by the given formula:

7 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


• For each hidden unit ‘h’, training error ‘ 𝛿 ‘ can be calculated by the given formula in which the training
error of output units to which the hidden layer is connected is taken into consideration:

• Update weight vectors by the given formula:

• weight vector from jth node to ith node is updated using above formula in which ‘η’ is the learning
rate, ‘𝛿’ is the training error and ‘x’ is the input vector for the given node.

Termination Criterion for Multi layer networks

The above algorithm is continuously implemented on all data points until we specify a termination criterion,
which can be implemented in either of these three ways:

• training the network for a fixed number of epochs ( iterations ).

• setting the threshold to an error, if the error goes below the given threshold, we can stop training the
neural network further.

• Creating a validation sample of data, after every iteration we validate our model with this data and
the iteration with the highest accuracy can be considered as the final model.

The first way of termination might not yield us better results , the most recommended way is the third way as
we are aware of the accuracy of our model so far.

3.3 Hidden layers and constructing intermediate


One intriguing property of BACKPROPAGATION is its ability to discover useful intermediate representations
at the hidden unit layers inside the network. Because training examples constrain only the network inputs and
outputs, the weight-tuning procedure is free to set weights that define whatever hidden unit representation is
most effective at minimizing the squared error E. This can lead BACKPROPAGATION to define new hidden
layer features that are not explicit in the input representation, but which capture properties of the input
instances that are most relevant to learning the target function.

Consider, for example, the network shown in Figure 4.7. Here, the eight network inputs are connected to
three hidden units, which are in turn connected to the eight output units. Because of this structure, the three
hidden units will be forced to re-represent the eight input values in some way that captures their relevant
features, so that this hidden layer representation can be used by the output units to compute the correct target
values.

8 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


Consider training the network shown in Figure 4.7 to learn the simple target function f (x) = x, where x is a
vector containing seven 0's and a single 1. The network must learn to reproduce the eight inputs at the
corresponding eight output units. Although this is a simple function, the network in this case is constrained to
use only three hidden units. Therefore, the essential information from all eight input units must be captured
by the three learned hidden units.

When BACKPROPAGATION is applied to this task, using each of the eight possible vectors as training
examples, it successfully learns the target function. What hidden layer representation is created by the gradient
descent BACKPROPAGATION algorithm? By examining the hidden unit values generated by the learned
network for each of the eight possible input vectors, it is easy to see that the learned encoding is similar to the
familiar standard binary encoding of eight values using three bits (e.g., 000,001,010,. . . , 111). The exact values
of the hidden units for one typical run of BACKPROPAGATION are shown in Figure 4.7

This ability of multilayer networks to automatically discover useful representations at the hidden layers is a key
feature of ANN learning.

3.4 distributed representations

Each neuron must represent something, so this must be a local representation

“Distributed representation” means a many-to- many relationship between two types of representation (such
as concepts and neurons).

– Each concept is represented by many neurons

– Each neuron participates in the representation of many concepts

3.5 Overfitting, learning network structure, recurrent networks .


Overfitting:

9 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


The goal of deep learning models is to generalize well with the help of training data to any data from the
problem domain. This is very crucial since we want our model to make predictions on the unseen dataset i.e,
it has never seen before.

In Overfitting, the model tries to learn too many details in the training data along with the noise from the
training data. As a result, the model performance is very poor on unseen or test datasets. Therefore, the
network fails to generalize the features or patterns present in the training dataset.

Overfitting during training can be spotted when the error on training data decreases to a very small value but
the error on the new data or test data increases to a large value.

Reasons for Overfitting

The possible reasons for Overfitting in neural networks are as follows:

The size of the training dataset is small

When the network tries to learn from a small dataset it will tend to have greater control over the dataset &
will make sure to satisfy all the data points exactly. So, the network is trying to memorize every single data
point and failing to capture the general trend from the training dataset.

The model tries to make predictions on Noisy Data

Overfitting also occurs when the model tries to make predictions on data that is very noisy, which is caused
due to an overly complex model having too many parameters. So, due to this, the overfitted model is
inaccurate as the trend does not reflect the reality present in the data.

Reduce the Model Complexity

Let’s first understand:

Why Deep Neural Networks are prone to Overfitting?

Deep neural networks are prone to overfitting because they learn millions or billions of parameters while
building the model. A model having this many parameters can overfit the training data because it has sufficient
capacity to do so.

The basic idea to deal with the problem of overfitting is to decrease the complexity of the model. To do so,
we can make the network smaller by simply removing the layers or reducing the number of neurons, etc.

Now, a question comes to mind:

How does Overfitting get reduced when we remove the layers or Reduce the Number of Neurons?

By removing some layers or reducing the number of neurons the network becomes less prone to overfitting as
the neurons contributing to overfitting are removed or deactivated. Therefore, the network has a smaller
number of parameters to learn because of which it cannot memorize all the data points & will be forced to
generalize.

10 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


Image Source: Google Images

But while using this technique to resolve the issue, one has to keep in mind to compute the input and output
dimensions of the various layers involved in the neural network.

While implementing this technique, we have to determine:

• How many layers to be removed

• How large your network should be

• How many neurons must be in a layer

There is no thumb of the rule to find the answer to the above questions, but there are some popular approaches
to do this which are described below:

• Grid Search: Apply Grid search Cross-Validation to find out the number of neurons or layers to reduce.

• Trimming: We can also prune our overfitted model by removing nodes or connections until it reaches
suitable performance on unseen datasets after model building.

So, In simple words in this technique, our aim is to make the neural network smaller to prevent it from
overfitting.

Recurrent networks:

Recurrent Neural Network (RNN) are a type of Neural Network where the output from previous step are fed
as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each
other, but in cases like when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words. Thus, RNN came into existence, which
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is Hidden
state, which remembers some information about a sequence.

11 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


RNN have a “memory” which remembers all information about what has been calculated. It uses the same
parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output.
This reduces the complexity of parameters, unlike other neural networks.

How RNN works

The working of a RNN can be understood with the help of below example:

Example:

Suppose there is a deeper network with one input layer, three hidden layers and one output layer. Then like
other neural networks, each hidden layer will have its own set of weights and biases, let’s say, for hidden layer
1 the weights and biases are (w1, b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer.
This means that each of these layers are independent of each other, i.e. they do not memorize the previous
outputs.

12 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


Now the RNN will do the following:

• RNN converts the independent activations into dependent activations by providing the same weights
and biases to all the layers, thus reducing the complexity of increasing parameters and memorizing each
previous outputs by giving each output as input to the next hidden layer.

• Hence these three layers can be joined together such that the weights and bias of all the hidden layers
is the same, into a single recurrent layer.

13 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


14 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
Support Vector Machines: Maximum margin linear separators. Quadratic programming solution to finding
maximum margin separators. Kernels for learning non-linear functions.

……………………………………………………………………………………………………………………………..

“The support vector machine (SVM) is a supervised learning method that generates input-output mapping
functions from a set of labeled training data." A Support Vector Machine (SVM) performs classification by
finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the
hyperplane are the support vectors.

Algorithm:

1. Define an optimal hyperplane: maximize margin.

2. Extend the above definition for non-linearly separable problems: have a penalty term for
misclassifications.

3. Map data to high dimensional space where it is easier to classify with linear decision surfaces:
reformulate problem so that data is mapped implicitly to this space.

To define an optimal hyperplane we need to maximize the width of the margin (w).

The beauty of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal
SVM analysis should produce a hyperplane that completely separates the vectors (cases) into two non-
overlapping classes. However, perfect separation may not be possible, or it may result in a model with so
many cases that the model does not classify correctly. In this situation SVM finds the hyperplane that maximizes
the margin and minimizes the misclassifications.

1. Maximum Margin Linear Separators

For the maximum margin hyperplane only examples on the margin matter (only these affect the distances).
These are called support vectors. The objective of the support vector machine algorithm is to find a hyperplane
in an N-dimensional space (N — the number of features) that distinctly classifies the data points.

15 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of
both classes. Maximizing the margin distance provides some reinforcement so that future data points can be
classified with more confidence.

Hyperplanes

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the
hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the
number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.

Support Vectors

Support vectors are data points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support
vectors will change the position of the hyperplane. These are the points that help us build our SVM. It will be
useful computationally if only a small fraction of the datapoints are support vectors, because we use the
support vectors to decide which side of the separator a test case is on.

16 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


The support vectors are indicated by the circles around them.

To find the maximum margin the separator, we have to solve following optimization problem:

w.xc+b>+1

for positive cases

w.xc+b<−1

for negative cases

and ||w||2

is as small as possible

17 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


2. Quadratic programming solution to finding maximum margin separators

18 | © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy