Unit 4 Notes

UNIT – IV
Chapter- 1 Artificial Neural Networks
1) Neurons and biological motivation

2) Linear threshold units.
Chapter-2 Perceptrons
1) Representational limitation and gradient descent training

2) Multilayer networks and Back propagation
3) Hidden layers and constructing intermediate
4) Distributed representations
5) Overfitting
6) Learning network structure
7) Recurrent networks
Chapter-3 Support Vector Machines
1) Maximum margin linear separators

2) Quadratic programming solution to finding maximum margin separators
3) Kernels for learning non-linear functions.
1
Chapter- 1 Artificial Neural Networks
1) Neurons and biological motivation

2) Linear threshold units.
What is Artificial Neural Network?
The term "Artificial Neural Network" is derived from Biological neural networks that develop the structure of
a human brain. Similar to the human brain that has neurons interconnected to one another, artificial neural
networks also have neurons that are interconnected to one another in various layers of the networks. These
neurons are known as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
2
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell nucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
The architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural network, we have to understand what a
neural network consists of. In order to define a neural network that consists of a large number of artificial
neurons, which are termed units arranged in a sequence of layers. Lets us look at various types of layers
available in an artificial neural network.
Artificial Neural Network primarily consists of three layers:
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to find hidden
features and patterns.
Output Layer:
3
The input goes through a series of transformations using the hidden layer, which finally results in output that is
conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias. This
computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output. Activation
functions choose whether a node should fire or not. Only those who are fired make it to the output layer. There
are distinctive activation functions available that can be applied upon the sort of task we are performing.
Advantages of Artificial Neural Network (ANN)

Parallel processing capability:
 Artificial neural networks have a numerical value that can perform more than one task simultaneously.
Storing data on the entire network:
 Data that is used in traditional programming is stored on the whole network, not on a database. The
disappearance of a couple of pieces of data in one place doesn't prevent the network from working.
Capability to work with incomplete knowledge:
 After ANN training, the information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.
Having a memory distribution:
 For ANN is to be able to adapt, it is important to determine the examples and to encourage the network
according to the desired output by demonstrating these examples to the network.
 The succession of the network is directly proportional to the chosen instances, and if the event can't
appear to the network in all its aspects, it can produce false output.
Having fault tolerance:
 Extortion of one or more cells of ANN does not prohibit it from generating output, and this feature
makes the network fault-tolerance.
Disadvantages of Artificial Neural Network:
Assurance of proper network structure:
 There is no particular guideline for determining the structure of artificial neural networks. The
appropriate network structure is accomplished through experience, trial, and error.
Unrecognized behavior of the network:
4
 It is the most significant issue of ANN. When ANN produces a testing solution, it does not provide
insight concerning why and how. It decreases trust in the network.
Hardware dependence:
 Artificial neural networks need processors with parallel processing power, as per their structure.
Therefore, the realization of the equipment is dependent.
Types of Artificial Neural Network
Neural Network works the same as the human nervous system functions. There are several types of neural
network. These networks implementation are based on the set of parameter and mathematical operation that are
required for determining the output.
Feedforward Neural Network (Artificial Neuron)
 FNN is the purest form of ANN in which input and data travel in only one direction. Data flows in an
only forward direction; that's why it is known as the Feedforward Neural Network.
 The data passes through input nodes and exit from the output nodes. T
 he nodes are not connected cyclically. It doesn't need to have a hidden layer. In FNN, there doesn't need
to be multiple layers. It may have a single layer also.
5
 It has a front propagate wave that is achieved by using a classifying activation function. All other types
of neural network use backpropagation, but FNN can't.
 In FNN, the sum of the product's input and weight are calculated, and then it is fed to the output.
Technologies such as face recognition and computer vision are used FNN.
Redial basis function Neural Network
 RBFNN find the distance of a point to the centre and considered it to work smoothly.
 There are two layers in the RBF Neural Network.
 In the inner layer, the features are combined with the radial basis function. Features provide an output
that is used in consideration. Other measures can also be used rather than Euclidean.
Redial Basis Function
o We define a receptor t.
o Confronted maps are drawn around the receptor.
o For RBF Gaussian Functions are generally used. So we can define the radial distance r=||X-t||.
Redial Function=Φ(r) = exp (- r2/2σ2), where σ > 0
6
This Neural Network is used in power restoration system. In the present era power system have increased in
size and complexity. It's both factors increase the risk of major power outages. Power needs to be restored as
quickly and reliably as possible after a blackout.
Multilayer Perceptron
 A Multilayer Perceptron has three or more layer. The data that cannot be separated linearly is
classified with the help of this network.
 This network is a fully connected network that means every single node is connected with all other
nodes that are in the next layer.
 A Nonlinear Activation Function is used in Multilayer Perceptron. It's input and output layer nodes
are connected as a directed graph.
 It is a deep learning method so that for training the network it uses backpropagation. It is extensively
applied in speech recognition and machine translation technologies.
Convolutional Neural Network
7
 In image classification and image recognition, a Convolutional Neural Network plays a vital role, or
we can say it is the main category for those.
 Face recognition, object detection, etc., are some areas where CNN are widely used. It is similar to
FNN, learn-able weights and biases are available in neurons.
 CNN takes an image as input that is classified and process under a certain category such as dog, cat,
lion, tiger, etc.
 As we know, the computer sees an image as pixels and depends on the resolution of the picture. Based
on image resolution, it will see h * w * d, where h= height w= width and d= dimension.
 For example, An RGB image is 6 * 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 3 array
of the pattern.
 In CNN, each input image will pass through a sequence of convolution layers along with pooling, fully
connected layers, filters (Also known as kernels). And apply Soft-max function to classify an object
with probabilistic values 0 and 1.
Recurrent Neural Network
 Recurrent Neural Network is based on prediction. In this neural network, the output of a particular
layer is saved and fed back to the input. It will help to predict the outcome of the layer.
 In Recurrent Neural Network, the first layer is formed in the same way as FNN's layer, and in the
subsequent layer, the recurrent neural network process begins.
 Both inputs and outputs are independent of each other, but in some cases, it required to predict the next
word of the sentence.
8
 Then it will depend on the previous word of the sentence. RNN is famous for its primary and most
important feature, i.e., Hidden State. Hidden State remembers the information about a sequence.
 RNN has a memory to store the result after calculation. RNN uses the same parameters on each input
to perform the same task on all the hidden layers or data to produce the output.
 Unlike other neural networks, RNN parameter complexity is less.
Modular Neural Network
 In Modular Neural Network, several different networks are functionally independent. In MNN the
task is divided into sub-task and perform by several systems.
 During the computational process, networks don't communicate directly with each other. All the
interfaces are work independently towards achieving the output.
 Combined networks are more powerful than flat and unrestricted. Intermediary takes the production of
each system, process them to produce the final output.
Sequence to Sequence Network

9
 It is consist of two recurrent neural networks. Here, encoder processes the input and decoder processes
the output. The encoder and decoder can either use for same or different parameter.
 Sequence-to-sequence models are applied in chatbots, machine translation, and question answering
systems.
2)Linear Threshold Units
Perceptron / Linear threshold units
 A Perceptron, on the other hand, is a single layer of LTU’s (Linear threshold units) which is similar to
an artificial neuron but the only difference is that the inputs and output might not be necessarily binary,
they can be any number.
 And each connection is associated with a weight.
 As shown in the figure below, the LTU operates a function f(x) on the combination of inputs and the
respective weights.
 A Linear Threshold Unit (LTU) as shown above is a perceptron which computes the linear
combination of these inputs and weights.
 Z = (x1)*(w1) + (x2)*(w2)
10
 And after this, it applies the function f(x) on Z, which is the resulting output of the LTU.
 But for an LTU to give an output it needs to know the values of the weights w1 and w2. Now here comes
the training part.
 The LTU is trained first to obtain the values of w1 and w2.
A Perceptron is composed of a single layer of LTU’s. Each of which is connected to every other LTU of the
previous layer of LTU’s or in other words the previous Perceptron.
The above combination of neurons and Perceptron receives two inputs and gives one output after the whole
computation process.
Perceptrons do not output a class probability, rather they just make predictions based on a hard threshold.
Training a Perceptron
A Perceptron is fed one training instance at a time. And for every output neuron that produced a wrong
prediction, it reinforces the connection weights from inputs that would have contributed to the correct prediction.
W(i, j) : = W(i, j) + n*(Y-y )x
W(i, j) — Connection weight between ith input neuron and jth output neuron.
n — Learning rate
Y — Output of the jth output neuron for the current training instance
y — Target output of the jth output neuron for the current training instance
x — ith input value of the current training instance
11
12
1
24
13
26
27
14
28
29
Chapter-2 Perceptrons
1) Representational limitation and gradient descent training

5) Overfitting
15
1) Introduction on Perceptrons
Perceptron consists of one or more inputs, a processor, and only one output.
What is the Perceptron model in Machine Learning?
Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to detect
certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks. However, it
is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and Bias, net sum, and an activation
function.
What is Binary classifier in Machine Learning?
In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input data can
be represented as vectors of numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and feature vectors.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main
components. These are as follows:
16
o Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and

o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various problem statements
and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron
models by checking whether the learning process is slow or has vanishing or exploding gradients.
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
17
Single Layer Perceptron Model:
 This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model
consists feed-forward network and also includes a threshold transfer function inside the model.
 The main objective of the single-layer perceptron model is to analyze the linearly separable objects with
binary outcomes.
 In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters.
 If the outcome is same as pre-determined or threshold value, then the performance of this model is
stated as satisfied, and weight demand does not change. "Single-layer perceptron can learn only linearly
separable patterns."
Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but has
a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the output
layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's requirement. In
this stage, the error between actual output and demanded originated backward on the output layer and ended on
the input layer.
A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Advantages of Multi-Layer Perceptron:
 A multi-layered perceptron model can be used to solve complex non-linear problems.
 It works well with both small and large input data.

 It helps us to obtain quick predictions after the training.
 It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:

18
 In Multi-layer perceptron, computations are difficult and time-consuming.
 In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each independent
variable.
 The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned weight
coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector

o 'b' represents the bias
o 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable classes +1 and
-1.
6. If the added sum of all input values is more than the threshold value, it must have an output signal; otherwise, no
output will be shown.
Limitations of Perceptron Model
A perceptron model has limitations as follows:
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer function.
19
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors are non-linear,
it is not easy to classify them properly.
1) Representational limitation and Gradient descent training
What is Gradient Descent?

 Gradient descent is an iterative machine learning optimization algorithm to reduce the cost
function so that we have models that makes accurate predictions.
 Cost function(C) or Loss function measures the difference between the actual output and predicted output
from the model. Cost function are a convex function.

Why do we need Gradient Descent?
 In neural network our goal is to train the model to have optimized weights(w) to make better prediction.
 we get optimized weights using gradient descent.
 Gradient descent helps us solve the same problem mathematically.
 We randomly initialize all the weights for a neural network to a value close to zero but not zero.
 we calculate the gradient, ∂c/∂ω which is a partial derivative of cost with respect to weight.
 α is learning rate, helps adjust the weights with respect to gradient descent
w is the weights for the neurons, α is learning rate, C is the cost and ∂c/∂ω is the gradient
we need to update the weights for all the neurons simultaneously
Learning Rate
Learning rate controls how much we should adjust the weights with respect to the loss gradient. Learning rates
are randomly initialized.
Lower the value of the learning rate, slower will be the convergence to global minima.
20
A higher value for learning rate will not allow the gradient descent to converge
Since our goal is to minimize the cost function to find the optimized value for weights, we run multiple iterations
with different weights and calculate the cost to arrive at a minimum cost as shown below
Different types of Gradient descents are

 Batch Gradient Descent
 Stochastic Gradient Descent
 Mini batch Gradient Descent
Batch Gradient Descent

 In batch gradient we use the entire dataset to compute the gradient of the cost function for each
iteration of the gradient descent and then update the weights.

 Since we use the entire dataset to compute the gradient convergence is slow.
 If the dataset is huge and contains millions or billions of data points then it is memory as well as
computationally intensive.
Batch gradient descent uses the entire dataset to calculate each iteration of gradient descent
Advantages of Batch Gradient Descent
21
 Theoretical analysis of weights and convergence rates are easy to understand
Disadvantages of Batch Gradient Descent
 Perform redundant computation for the same training example for large datasets
 Can be very slow and intractable as large datasets may not fit in the memory
 As we take the entire dataset for computation we can update the weights of the model for the new data
Stochastic Gradient descent
In stochastic gradient descent we use a single datapoint or example to calculate the gradient and update the
weights with every iteration.
we first need to shuffle the dataset so that we get a completely randomized dataset. As the dataset is
randomized and weights are updated for each single example, update of the weights and the cost function will be
noisy jumping all over the place as shown below
Random sample helps to arrive at a global minima and avoids getting stuck at a local minima.
Learning is much faster and convergence is quick for a very large dataset.
Advantages of Stochastic Gradient Descent

 Learning is much faster than batch gradient descent
 Redundancy is computation is removed as we take one training sample at a time for computation
 Weights can be updated on the fly for the new data samples as we take one training sample at a time for
computation
Disadvantages of Stochastic Gradient Descent
22
 As we frequently update weights, Cost function fluctuates heavily
Mini Batch Gradient descent

 Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example,
mini-batch of samples is used.
 Mini batch gradient descent is widely used and converges faster and is more stable.
 Batch size can vary depending on the dataset.
 As we take a batch with different samples,it reduces the noise which is variance of the weight updates
and that helps to have a more stable converge faster.
Advantages of Min Batch Gradient Descent

 Reduces variance of the parameter update and hence lead to stable convergence
 Speeds the learning
 Helpful to estimate the approximate location of the actual minimum
Disadvantages of Mini Batch Gradient Descent

 Loss is computed for each mini batch and hence total loss needs to be accumulated across all mini batches
23
Multi-Layer Perceptron
 A Network Network model which has more than one layer of perceptrons is known as a Multi-Layer
Perceptron.
 It comprises of an input layer, one or more layers of LTUs and one output layer.
 The layers other than the input and output layers are also known as hidden layers. When there are two or
more hidden layers the Neural Network is known as Deep Neural Network.
A 2-Layer Perceptron is given in the figure below -
 Each Layer of the Neural Network except the output layer has a neuron which always gives 1 as output.
This Neuron is known as Bias Neuron.
24
Back propagation
What is Back propagation
 Back propagation is the essence of neural network training. It is the method of fine-tuning the weights
of a neural network based on the error rate obtained in the previous epoch (i.e., iteration).
 Proper tuning of the weights allows you to reduce error rates and make the model reliable by increasing
its generalization.
 Backpropagation in neural network is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks.
This method helps calculate the gradient of a loss function with respect to all the weights in the How
Backpropagation Algorithm Works
The Back propagation algorithm in neural network computes the gradient of the loss function for a single
weight by the chain rule.
It efficiently computes one layer at a time, unlike a native direct computation. It computes the gradient, but it
does not define how the gradient is used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually randomly selected.
25
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased.
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program

 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.
What is a Feed Forward Network?
A feedforward neural network is an artificial neural network where the nodes never form a cycle. This kind of
neural network has an input layer, hidden layers, and an output layer. It is the first and simplest type of artificial
neural network.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
 Static Back-propagation
 Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static output. It is
useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that, the error is
computed and propagated backward.
26
Disadvantages of using Backpropagation
 The actual performance of backpropagation on a specific problem is dependent on the input data.
 Back propagation algorithm in data mining can be quite sensitive to noisy data
 You need to use the matrix-based approach for backpropagation instead of mini-batch.
Introduction
ANN is inspired by the biological neural network. For simplicity, in computer science, it is represented as a set
of layers. These layers are categorized into three classes which are input, hidden, and output.
Knowing the number of input and output layers and the number of their neurons is the easiest part. Every
network has a single input layer and a single output layer. The number of neurons in the input layer equals the
number of input variables in the data being processed. The number of neurons in the output layer equals the
number of outputs associated with each input. But the challenge is knowing the number of hidden layers and
their neurons.
Here are some guidelines to know the number of hidden layers and neurons per each hidden layer in a
classification problem:
1. Based on the data, draw an expected decision boundary to separate the classes.
2. Express the decision boundary as a set of lines. Note that the combination of such lines must yield to the
decision boundary.
3. The number of selected lines represents the number of hidden neurons in the first hidden layer.
4. To connect the lines created by the previous layer, a new hidden layer is added. Note that a new hidden
layer is added each time you need to create connections among the lines in the previous hidden layer.
5. The number of hidden neurons in each new hidden layer equals the number of connections to be made.
To make things clearer, let’s apply the previous guidelines for a number of examples.
27
Example 1
Let’s start with a simple example of a classification problem with two classes as shown in figure 1. Each sample
has two inputs and one output that represents the class label. It is much similar to XOR problem.
Figure 1
The first question to answer is whether hidden layers are required or not. A rule to follow in order to determine
whether hidden layers are required or not is as follows:
In artificial neural networks, hidden layers are required if and only if the data must be separated non-
linearly.
Looking at figure 2, it seems that the classes must be non-linearly separated. A single line will not work. As a
result, we must use hidden layers in order to get the best decision boundary. In such case, we may still not use
hidden layers but this will affect the classification accuracy. So, it is better to use hidden layers.
In order to add hidden layers, we need to answer these following two questions:
1. What is the required number of hidden layers?
2. What is the number of the hidden neurons across each hidden layer?
Following the previous procedure, the first step is to draw the decision boundary that splits the two classes.
There is more than one possible decision boundary that splits the data correctly as shown in figure 2. The one we
will use for further discussion is in figure 2(a).
28
Figure 2
Following the guidelines, next step is to express the decision boundary by a set of lines.
The idea of representing the decision boundary using a set of lines comes from the fact that any ANN is built
using the single layer perceptron as a building block. The single layer perceptron is a linear classifier which
separates the classes using a line created according to the following equation:
y = w_1*x_1 + w_2*x_2 + ⋯ + w_i*x_i + b
Where x_i is the input, w_i is its weight, b is the bias, and y is the output. Because each hidden neuron added
will increase the number of weights, thus it is recommended to use the least number of hidden neurons that
accomplish the task. Using more hidden neurons than required will add more complexity.
29
 The concept of distributed representations is often central to deep learning, particularly as it applies
to natural language tasks.
 Those beginning in the field may quickly understand this as simply a vector that represents some
piece of data. While this is true, understanding distributed representations at a more conceptual level
increases our appreciation of the role they play in making deep learning so effective.
 To examine different types of representation, we can do a simple thought exercise. Let’s say we
have a bunch of “memory units” to store information about shapes. We can choose to represent each
individual shape with a single memory unit, as demonstrated in Figure 1.
 This non-distributed representation, referred to as “sparse” or “local,” is inefficient in multiple ways.

First, the dimensionality of our representation will grow as the number of shapes we observe grows.
More importantly, it doesn’t provide any information about how these shapes relate to each other. This
is the true value of a distributed representation: its ability to capture meaningful “semantic similarity”
between between data through concepts.
30
Figure 2 shows a distributed representation of this same set of shapes where information about the shape is
represented with multiple “memory units” for concepts related to orientation and shape. Now the “memory
units” contain information both about an individual shape and how each shape relates to each other .
5) Overfitting
What is Overfitting?
When a model performs very well for training data but has poor performance with test data (new data), it is
known as overfitting. In this case, the machine learning model learns the details and noise in the training data
such that it negatively affects the performance of the model on test data. Overfitting can happen due to low bias
and high variance.
Reasons for Overfitting

 Data used for training is not cleaned and contains noise (garbage values) in it
 The model has a high variance
 The size of the training dataset used is not enough
 The model is too complex
Ways to Tackle Overfitting
 Using K-fold cross-validation
 Using Regularization techniques such as Lasso and Ridge
 Training model with sufficient data
 Adopting ensembling techniques
What is Underfitting?
When a model has not learned the patterns in the training data well and is unable to generalize well on the new
data, it is known as underfitting. An underfit model has poor performance on the training data and will result in
unreliable predictions. Underfitting occurs due to high bias and low variance.
31
Reasons for Underfitting
 Data used for training is not cleaned and contains noise (garbage values) in it
 The model has a high bias
 The size of the training dataset used is not enough
 The model is too simple
Ways to Tackle Underfitting
 Increase the number of features in the dataset
 Increase model complexity
 Reduce noise in the data
 Increase the duration of training the data
Now that you have understood what overfitting and underfitting are, let’s see what is a good fit model in this
tutorial on overfitting and underfitting in machine learning.
Introduction of Artificial Neural Network for Machine Learning
 .It intended to simulate the behavior of biological systems composed of “neurons”. ANNs are
computational models inspired by an animal’s central nervous systems. It is capable of machine
learning as well as pattern recognition.
32
 These presented as systems of interconnected “neurons” which can compute values from inputs.
A neural network is an oriented graph.
 It consists of nodes which in the biological analogy represent neurons, connected by arcs. It corresponds
to dendrites and synapses. Each arc associated with a weight while at each node.
 Apply the values received as input by the node and define Activation function along the incoming arcs,
adjusted by the weights of the arcs.
 Structure of a Biological Neural NetworkA neural network is a machine learning algorithm based on the
model of a human neuron.
 The human brain consists of millions of neurons. It sends and process signals in the form of electrical and
chemical signals.
 These neurons are connected with a special structure known as synapses. Synapses allow neurons to pass
signals. From large numbers of simulated neurons neural networks forms.
 An Artificial Neural Network is an information processing technique.
 It works like the way the human brain processes information. ANN includes a large number of connected
processing units that work together to process information. They also generate meaningful results from it.
33
 We can apply the Neural network not only for classification. It can also apply for the regression of
continuous target attributes.
 Neural networks find great application in data mining used in sectors. For example economics, forensics,
etc and for pattern recognition. It can be also used for data classification in a large amount of data after
careful training.
Artificial Neural Network Layers
a. Input layer
 The purpose of the input layer is to receive as input the values of the explanatory attributes for each
observation. Usually, the number of input nodes in an input layer is equal to the number of explanatory
variables. ‘input layer’ presents the patterns to the network, which communicates to one or more ‘hidden
layers’.
 The nodes of the input layer are passive, meaning they do not change the data. They receive a single
value on their input and duplicate the value to their many outputs. From the input layer, it duplicates each
value and sent to all the hidden nodes.
34
b. Hidden Layer
 The Hidden layers apply given transformations to the input values inside the network. In this, incoming
arcs that go from other hidden nodes or from input nodes connected to each node. It connects with
outgoing arcs to output nodes or to other hidden nodes. In the hidden layer, the actual processing is done
via a system of weighted ‘connections’.
 There may be one or more hidden layers. The values entering a hidden node multiplied by weights, a set
of predetermined numbers stored in the program. The weighted inputs are then added to produce a single
number.
c. Output layer
 The hidden layers then link to an ‘output layer‘. Output layer receives connections from hidden layers or
from the input layer. It returns an output value that corresponds to the prediction of the response variable.
 In classification problems, there is usually only one output node. The active nodes of the output layer
combine and change the data to produce the output values.
Structure of a Neural Network

 The structure of a neural network also referred to as its ‘architecture’ or ‘topology’. It consists of the
number of layers, Elementary units.
 It also consists of Interconchangend Weight adjustment mechanism.
 The choice of the structure determines the results which are going to obtain. It is the most critical part of
the implementation of a neural network.
 The simplest structure is the one in which units distributes in two layers:
 An input layer and an output layer. Each unit in the input layer has a single input and a single output
which is equal to the input. The output unit has all the units of the input layer connected to its input, with
a combination function and a transfer function.
35
 There may be more than 1 output unit. In this case, the resulting model is a linear or logistic regression.
This is depending on whether the transfer function is linear or logistics. The weights of the network are
regression coefficients.
 By adding 1 or more hidden layers between the input and output layers and units in this layer the
predictive power of a neural network increases
 . But a number of hidden layers should be as small as possible. This ensures that the neural network does
not store all information from learning set but can generalize it to avoid overfitting.
 Overfitting can occur. It occurs when weights make the system learn details of learning set instead of
discovering structures. This happens when the size of the learning set is too small in relation to the
complexity of the model.
A hidden layer is present or not, the output layer of the network can sometimes have many units when
there are many classes to predict.
 Recurrent Neural Network(RNN) are a type of Neural Network where the output from
previous step are fed as input to the current step.
 In traditional neural networks, all the inputs and outputs are independent of each other, but in
cases like when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words.
 Thus RNN came into existence, which solved this issue with the help of a Hidden Layer. The
main and most important feature of RNN is Hidden state, which remembers some
information about a sequence.
36
RNN have a “memory” which remembers all information about what has been calculated. It uses the
same parameters for each input as it performs the same task on all the inputs or hidden layers to
produce the output. This reduces the complexity of parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and one output layer.
Then like other neural networks, each hidden layer will have its own set of weights and biases, let’s
say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second hidden layer and (w3,
b3) for third hidden layer. This means that each of these layers are independent of each other, i.e.
they do not memorize the previous outputs.
Now the RNN will do the following:
 RNN converts the independent activations into dependent activations by providing the same
weights and biases to all the layers, thus reducing the complexity of increasing parameters and
memorizing each previous outputs by giving each output as input to the next hidden layer.
 Hence these three layers can be joined together such that the weights and bias of all the hidden
layers is the same, into a single recurrent layer.
37
 Formula for calculating current state:
where:
ht -> current state

ht-1 -> previous state
xt -> input state
 Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron

wxh -> weight at input neuron
 Formula for calculating output:
Yt -> output
Why -> weight at output layer
38
Training through RNN
1. A single time step of the input is provided to the network.

2. Then calculate its current state using set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information from all the
previous states.
5. Once all the time steps are completed the final current state is used to calculate the output.
6. The output is then compared to the actual output i.e the target output and the error is generated.
7. The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained.
Advantages of Recurrent Neural Network
1. An RNN remembers each and every information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well. This is called Long
Short Term Memory.
2. Recurrent neural network are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.
Chapter-3 Support Vector Machines
1) Maximum margin linear separators
“The support vector machine (SVM) is a supervised learning method that generates input-output mapping
functions from a set of labeled training data." A Support Vector Machine (SVM) performs classification by
finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the
39
hyperplane are the support vectors.
Algorithm:
1. Define an optimal hyperplane: maximize margin.

2. Extend the above definition for non-linearly separable problems: have a penalty term for
misclassifications.
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces:
reformulate problem so that data is mapped implicitly to this space.
1. Maximum Margin Linear Separators
For the maximum margin hyperplane only examples on the margin matter (only these affect the distances).
These are called support vectors. The objective of the support vector machine algorithm is to find a hyperplane
in an N-dimensional space (N — the number of features) that distinctly classifies the data points.
40
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both
classes. Maximizing the margin distance provides some reinforcement so that future data points can be
classified with more confidence.
Hyperplanes
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the
hyperplane can be attributed to different classes.
Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is
2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Support Vectors
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the
hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors
will change the position of the hyperplane.
These are the points that help us build our SVM. It will be useful computationally if only a small fraction of the
datapoints are support vectors, because we use the support vectors to decide which side of the separator a test
case is on.
41
The support vectors are indicated by the circles around them.
To find the maximum margin the separator, we have to solve following optimization problem
w.xc+b> +1w.xc+b >+ 1 for positive cases

w.xc+b< −1w.xc+b< −1 for negative cases
and ||w||2||w||2 is as small as possible
2 Quadratic Programming Solution to Finding Maximum Margin Separators
3. Kernels for Learning Non-Linear Functions
Linear models are nice and interpretable but have limitations. Can’t learn difficult" nonlinear patterns.
2) Kernels for learning non-linear functions.
Need of the Kernel

Let’s say we have a 2-dimensional dataset with two classes of observations, as illustrated in fig-1 below, we
need to design a function to separate them. As it can be seen below that the given data in a 2-dimensional
space is not linearly separable.
42
How Does a Kernel Work?
 Kernel machines are a class of pattern-analysis algorithms, the most well-known member of which is
the support vector machine (SVM).
 The general objective of pattern analysis is to discover and investigate various sorts of relationships
(for example, clusters, ranks, principal components, correlations, and classifications) in datasets
 Kernel methods are approaches for dealing with linearly inseparable data or non-linear data sets like
those presented in fig-1. The concept is to use a mapping function to project nonlinear combinations
of the original features onto a higher-dimensional space, where the data becomes linearly separable.
The two-dimensional dataset (X1, X2) is projected into a new three-dimensional feature space (Z1,
Z2, Z3) in the diagram above, where the classes become separable.
The Kernel Trick

 We’ve seen how higher-dimensional transformations can help us separate data so that classification
predictions can be made.
 It appears that we will have to operate on the higher dimensional vectors in the modified feature
space in order to train a support vector classifier and maximize our objective function.
 In real-world applications, data may contain numerous features, and transformations using multiple
polynomial combinations of these features will result in extremely large and prohibitive processing
costs.
43
Types of Kernel Functions
 The kernel function is a function that may be expressed as the dot product of the mapping function
(kernel method) and looks like this,
 K(xi,xj) = Ø(xi) . Ø(xj)
 The kernel function simplifies the process of determining the mapping function. As a result, the
kernel function in the altered space specifies the inner product.
Polynomial Kernel
 The polynomial kernel is a kernel function that allows the learning of non-linear models by
representing the similarity of vectors (training samples) in a feature space over polynomials of the
original variables. It is often used with support vector machines (SVMs) and other kernelized
models.
 F(x, xj) = (x.xj+1)^d
Sigmoid Kernel
 It is primarily used in neural networks. This kernel function is similar to the activation function for
neurons in a two-layer perceptron model of a neural network.
 F(x, xj) = tanh(αxay + c)
Linear Kernel
 It is the most fundamental sort of kernel and is usually one-dimensional in structure. When there are
numerous characteristics, it proves to be the best function. The linear kernel is commonly used for
text classification issues since most of these problems can be linearly split. Other functions are
slower than linear kernel functions.
 F(x, xj) = sum( x.xj)
Radial Basis Function (RBF) Kernel

 The radial basis function kernel, often known as the RBF kernel, is a prominent kernel function that
is utilized in a variety of kernelized learning techniques. It is most typically used in support vector
machine classification. The RBF kernel is defined on two samples x and x’, which are represented as
feature vectors in some input space, as
44
ADDITIONAL MATERIAL
Artificial Neural Networks Applications

Artificial Neural Network used to perform a various task. Also, this task performs that are busy with humans
but difficult for a machine.
a. Aerospace
Generally, we use ANN a for Autopilot aircrafts. They used for aircraft fault detection.
b. Military
In various ways, we use ANN an in the military. Such as Weapon orientation and steering, target tracking.
c. Electronics
Basically, we use an Artificial neural network in electronics in many ways. That are code sequence prediction,
IC chip layout, and chip failure analysis.
d. Medical
As medical has too many machines. That use in various ways. Such as cancer cell analysis, EEG and ECG
analysis.
e. Speech
We use ANN in speech recognition and speech classification.
f. Telecommunications
Generally, it has different applications. Thus, we use an Artificial neural network in many ways. Such as image
and data compression, automated information services.
g. Transportation
Generally, we use an Artificial neural network in transportation in many ways. That are truck Brake system
diagnosis and vehicle scheduling, routing systems.
h. Software
It also uses an ANN in pattern Recognition. Such as in facial recognition, optical character recognition, etc.
i. Time Series Prediction
We use an Artificial neural network to predict time. Also, we use ANNs to make predictions on stocks and
natural calamities.
So, this was all about Artificial Neural Network (ANN) Tutorial. Hope you like our explanation.
45

Unit 4 Notes

Uploaded by

Copyright:

Available Formats

Unit 4 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Notes

Uploaded by

Copyright:

Available Formats

UNIT – IV

Chapter- 1 Artificial Neural Networks

1) Neurons and biological motivation

1) Representational limitation and gradient descent training

Chapter-3 Support Vector Machines

1) Maximum margin linear separators

1) Neurons and biological motivation

What is Artificial Neural Network?

The architecture of an artificial neural network:

Artificial Neural Network primarily consists of three layers:

Advantages of Artificial Neural Network (ANN)

Feedforward Neural Network (Artificial Neuron)

Redial basis function Neural Network

Redial Basis Function

Redial Function=Φ(r) = exp (- r2/2σ2), where σ > 0

Convolutional Neural Network

Recurrent Neural Network

Modular Neural Network

Sequence to Sequence Network

2)Linear Threshold Units

Perceptron / Linear threshold units

 And each connection is associated with a weight.

1) Representational limitation and gradient descent training

What is the Perceptron model in Machine Learning?

What is Binary classifier in Machine Learning?

Basic Components of Perceptron

o Wight and Bias:

Types of Activation functions:

o Step function, and

Types of Perceptron Models

1. Single-layer Perceptron Model

2. Multi-layer Perceptron model

Multi-Layered Perceptron Model:

Advantages of Multi-Layer Perceptron:

 A multi-layered perceptron model can be used to solve complex non-linear problems.

 It works well with both small and large input data.

Disadvantages of Multi-Layer Perceptron:

Mathematically, we can express it as follows:

o 'w' represents real-valued weights vector

o 'x' represents a vector of input x values.

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.

2. In Perceptron, the weight coefficient is automatically learned.

Limitations of Perceptron Model

A perceptron model has limitations as follows:

1) Representational limitation and Gradient descent training

What is Gradient Descent?

function so that we have models that makes accurate predictions.

from the model. Cost function are a convex function.

 we get optimized weights using gradient descent.

 Gradient descent helps us solve the same problem mathematically.

we need to update the weights for all the neurons simultaneously

are randomly initialized.

Different types of Gradient descents are

Batch Gradient Descent

iteration of the gradient descent and then update the weights.

Advantages of Batch Gradient Descent

Disadvantages of Batch Gradient Descent

Stochastic Gradient descent

y = w_1x_1 + w_2x_2 + ⋯ + w_i*x_i + b