Module1 - Upto Loss Function
Module1 - Upto Loss Function
to Deep Learning
Module 1
• Fundamentals of Deep Learning • Training Neural
• Perceptron Networks-
• Multilayer Perceptron Backpropagation
• Hyper parameters
• Activation Functions • Under fitting
• Loss Functions • Overfitting
• Optimization techniques- • Regularization
Gradient Descent • Dropouts
• Feedforward Neural Network • Batch Normalization
Fundamentals of Deep Learning
Deep Learning (DL)
• Deep learning is way of classifying, clustering, and
predicting things by using a neural network that has been
trained on vast amounts of data.
Applications of DL
Deep Learning (DL)
• DL has its roots in neural networks (NN)
• NN are a set of complex algorithms that are
designed for pattern recognition.
• These NNs are modeled after human brain and its
biological neuron.
• A human brain has roughly 86 billion neurons
connected to many other neurons.
• The fundamental unit of a NN is a node, based on
the biological neuron of a human brain.
Deep Learning (DL)
Deep NN
• These are NN with more than two layers.
• 'Deep' - no. of hidden layers.
Some DL Architectures
Designing a NN
• Movement of information
in a NN happens in two
stages
(feed)forward propagation
and backpropagation
Perceptron
A single-layer perceptron is the basic unit of a neural network. A perceptron consists
of input values, weights and a bias, a weighted sum and activation function.
• A perceptron works by taking in some numerical inputs along with what is known
as weights and a bias.
• It then multiplies these inputs with the respective weights(this is known as the
weighted sum).
• These products are then added together along with the bias.
• The activation function takes the weighted sum and the bias as inputs and
returns a final output.
Assume we have a single neuron and three inputs x1, x2, x3 multiplied by the
weights w1, w2, w3 respectively as shown below
given the numerical value of the inputs and the weights, there is a function, inside
the neuron, that will produce an output.
An activation function is a function that converts the input given (the input, in this
case, would be the weighted sum) into a certain output based on a set of rules.
Designing a NN
Multi-Layer Perceptron
MLP : Multi Layer Perceptron
Build a network with 2 input neurons, 3 hidden neurons, 2 output neurons, and 4 observations in training
set.
Use same number of layers and neurons but reduce the number of observations in dataset to 1 instance:
Activation Functions
What is an Activation Function?
• The output of the functions will not be confined between any range.
Disadvantages of Linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
The Nonlinear Activation Functions are mainly divided on the basis of their range
or curves
Advantages of Non-Linear Activation Functions
The main reason why we use sigmoid function is because it exists between (0 to
1). Therefore, it is especially used for models where we have to predict the probability as
an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is
the right choice.
Smaller the input number (more
negative) 0
Adds Non-Linearity
Greater the input number (more
positive) 1
Disadvantages of Sigmoid Activation Function
• The output range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s -
shaped).
• Tanh is zero
centered.
• Negative inputs are
mapped strongly
negative
• Positive inputs are
mapped strongly
positive
• Zero inputs are
mapped near zero
• The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks or deep learning.
• The ReLU is half rectified (from bottom). R(z) is zero when z is less than
zero and R(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• Any negative input given to the ReLU activation function turns the value into
zero immediately in the graph, which in turns affects the resulting graph by
not mapping the negative values appropriately.
Disadvantages of ReLU :
•Sigmoids and tanh functions are sometimes avoided due to the vanishing
gradient problem
•If we encounter a case of dead neurons in our networks the leaky ReLU
function is the best choice
•Always keep in mind that ReLU function should only be used in the
hidden layers. At current time, ReLu works most of the time as a general
approximator
• Variants of ReLU
• Leaky ReLU
• Parametric ReLU
• Exponential Linear Unit
SoftMax Activation Function
• When constructing Artificial Neural Network (ANN) models, one of the key
considerations is to select an activation functions for the hidden and output
layers that are differentiable. I,e their derivatives should not be zero
1 Sigmoid
2 Softmax
3 ReLu
4 Leaky ReLu
6 TanH
Tip 1:
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax
activation distributes the probability throughout each output node.
4 Hinge Loss/Multi
class SVM Loss
5 Cross Entropy Classification
Loss/Negative Log
Likelihood
6 Hubber
P : Actual Probability
Q : Predicted Probability
Entropy :
Loss Functions
BACK - PROPAGATION
10/17/2022
74
C = Loss = Mean Squared Error()
10/17/2022
75
10/17/2022
76
10/17/2022
77
10/17/2022
78
10/17/2022
79
10/17/2022
80
Optimization
Given an function f(x), an optimization algorithm help in either minimizing or
maximizing the value of f(x).
In Deep learning, optimization algorithms are used to train the neural network by
optimizing the cost function J. The cost function is defined as:
• The value of cost function J is the mean of the loss L between the predicted value
y’ and actual value y.
• The value y’ is obtained during the forward propagation step and makes use of the
Weights W and biases b of the network.
• With the help of optimization algorithms, we minimize the value of Cost Function J
by updating the values of the trainable parameters W and b.
10/17/2022
82
10/17/2022
83
Gradient Descent
Batch Gradient Descent
10/17/2022
85
• Batch Gradient Descent involves calculations
over the full training set at each step as a result
of which it is very slow on very large training
data.
• Thus, it becomes very computationally expensive
to do Batch GD.
10/17/2022
87
10/17/2022
88
• In Stochastic Gradient Descent (SGD), we consider just one example at a
time to take a single step. We do the following steps in one epoch for SGD:
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
•
• Drawback:
• SGD takes more number of iterations compared to GD to reach minimum and
also contains some noise when compared to Gradient Descent.
• As SGD computes derivatives of only 1 point at a time, the time taken to
complete one epoch is large compared to Gradient Descent algorithm.
Mini Batch Stochastic Gradient
Descent
• MB-SGD is an extension of SGD algorithm.
• It is also common to sample a small number of data points instead of just
one point at each step and that is called “mini-batch” gradient descent. Mini-
batch tries to strike a balance between the goodness of gradient descent and
speed of SGD.
• It overcomes the time-consuming complexity of SGD by taking a batch of
points / subset of points from dataset to compute derivative.
• after creating the mini-batches of fixed size, we do the following steps in one
epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Drawback is the update of weights is much noisier because the derivative is
not always towards minima.
Types - Gradient Descent
Batch GD : θ=θ−η⋅∇θJ(θ)
SGD : θ=θ−η⋅∇θJ(θ;x(i);y(i))
In gradient descent one is trying to reach the minimum of the loss function with
respect to the parameters using the derivatives calculated in the back-propagation.
The easiest way would be to adjust the parameters by substracting its corresponding
derivative multiplied by a learning rate, which regulates how much you want to move
in the gradient direction.
The three main flavors of gradient descent are batch, stochastic, and mini-batch.
This is not a learning method, but rather a nice computational trick which is often
used in learning methods.
This is actually a simple implementation of chain rule of derivatives, which simply
gives you the ability to compute all required partial derivatives in linear time
Trained with SGD using backprop as a gradient computing technique
Back Propagation
Back Propagation
The goal of back Propagation is to optimize the weights so that the neural network can learn how to correctly map
arbitrary inputs to outputs.
Total Error
Back Propagation
Backward Pass
Consider . , We want to know how much a change in affects
the total error, (Gradient w.r.t )
Next, how much does the output of change with respect to its total net input?
What is a gradient ?
• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.
• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
• when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem
• In networks with few layers and sigmoid activation function, there is
no problem of vanishing gradient
• when more layers are used, it can cause the gradient to be too small
for training to work effectively.
• Gradients of neural networks are found using backpropagation
• backpropagation finds the derivatives of the network by moving layer
by layer from the final layer to the initial one
• By the chain rule, the derivatives of each layer are multiplied down
the network (from the final layer to the initial) to compute the
derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid
function, n small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to
the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session
• Since these initial layers are often crucial to recognizing the core
elements of the input data, it can lead to overall inaccuracy of the
whole network.
Ways to detect whether your deep network is suffering from the
vanishing gradient problem: -
The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training
does not improve the model.
The weights closer to the output layer of the model would witness more of
a change whereas the layers that occur closer to the input layer would not
change much (if at all).
Model weights shrink exponentially and become very small when training
the model.
• Vanishing gradients usually happen while using the Sigmoid or Tanh activation
functions in the hidden layer units.
• Looking at the function plot below, we can see that when inputs become very
small or very large, the sigmoid function saturates at 0 and 1 and the tanh
function saturates at -1 and 1.
• In both these cases, their derivatives are extremely close to 0.
• these ranges/regions of the function “saturating regions” or “bad regions”.
• Thus, if your input lies in any of the saturating regions, then it has almost no
gradient to propagate back through the network.
• batch normalization can be simply visualized as an additional layer in
the network that normalizes the data (using a mean and standard
deviation) before feeding it into the hidden unit activation function.
• Batch normalization normalizes the input and ensures that|x| lies within
the “good range” (marked as the green region) and doesn’t reach the
outer edges of the sigmoid function.
• If the input is in the good range, then the activation does not saturate,
and thus the derivative also stays in the good range, i.e- the derivative
value isn’t too small.
• Thus, batch normalization prevents the gradients from becoming too
small and makes sure that the gradient signal is heard.
Exploding Gradient Problem
Exploding gradients are a problem where large error gradients accumulate and
result in very large updates to neural network model weights during training
Results in model being unstable and unable to learn from your training data
Ways to detect whether your deep network is suffering from the
exploding gradient problem: -
Model weights grow exponentially and become very large when training the
model.
The model weights become NaN in the training phase.
Approaches to address both vanishing and exploding gradient
problems
1. Reducing the amount of Layers
This is solution could be used in both, scenarios (exploding and vanishing
gradient). However, by reducing the amount of layers in our network, we give up
some of our models complexity, since having more layers makes the networks
more capable of representing complex mappings.
3. Weight Initialization
A more careful initialization choice of the random initialization for your network
tends to be a partial solution, since it does not solve the problem completely.
Training a NN in Keras
Data Set : Pima Indians Diabetes Data Set
It describes patient medical record data for Pima Indians and whether
they had an onset of diabetes within five years.
It is a binary classification problem (onset of diabetes as 1 or not as 0).
The input variables that describe each patient are numerical and have
varying scales.
Below lists the eight attributes for the dataset:
1. Number of times pregnant. 2. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test. 3. Diastolic blood pressure (mm Hg). 4.
Triceps skin fold thickness (mm). 5. 2-Hour serum insulin (mu U/ml). 6.
Body mass index. 7. Diabetes pedigree function. 8. Age (years). 9. Class,
onset of diabetes within five years.
Sample records:
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
Neural Network Structure
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("/content/pima-indians-diabetes.csv")
#Output
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
#Output
model.get_config()
Calculating the No. of Trainable Parameters
3×4+4×2+1×4+1×2
=3×4+4×2+4+2
=i×h+h×o+h+o
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are
respectively 3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively
3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
• Number of connections between the first and second layer: 3 × 5 = 15, which is
nothing but the product of i and h1.
• Number of connections between the second and third layer: 5 × 6 = 30, which is
nothing but the product of h1 and h2.
• Number of connections between the third and fourth layer: 6 × 4 = 24, which is
nothing but the product of h2 and h3.
• Number of connections between the fourth and fifth layer: 4 × 2= 8, which is
nothing but the product of h3 and o.
• Number of connections between the bias of the first layer and the neurons of
the second layer (except bias of the second layer): 1 × 5 = 5, which is nothing
but h1.
• Number of connections between the bias of the second layer and the neurons
of the third layer: 1 × 6 = 6, which is nothing but h2.
• Number of connections between the bias of the third layer and the neurons of
the fourth layer: 1 × 4 = 4, which is nothing but h3.
• Number of connections between the bias of the fourth layer and the neurons of
the fifth layer: 1 × 2 = 2, which is nothing but o.
• Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.
• Thus, the total number of parameters in a feed-forward neural network with three
hidden layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Calculate the number of trainable parameters for this model :
• The weights of the network will be larger than the normal because of
dropout.
• Hence weights are scaled down using the chosen dropout rate.
Batch Size
Total number of training examples present in a single batch.
Designing a DNN
DNN are NNs that are designed to mimic human intelligence
Points to consider while designing :