0% found this document useful (0 votes)

17 views

Module1 - Upto Loss Function

Uploaded by

Saud Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Module1 - Upto Loss Function

Uploaded by

Saud Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 137

CSE235-Introduction

to Deep Learning
Module 1
• Fundamentals of Deep Learning • Training Neural
• Perceptron Networks-
• Multilayer Perceptron Backpropagation
• Hyper parameters
• Activation Functions • Under fitting
• Loss Functions • Overfitting
• Optimization techniques- • Regularization
Gradient Descent • Dropouts
• Feedforward Neural Network • Batch Normalization
Fundamentals of Deep Learning
Deep Learning (DL)
• Deep learning is way of classifying, clustering, and
predicting things by using a neural network that has been
trained on vast amounts of data.
Applications of DL
Deep Learning (DL)
• DL has its roots in neural networks (NN)
• NN are a set of complex algorithms that are
designed for pattern recognition.
• These NNs are modeled after human brain and its
biological neuron.
• A human brain has roughly 86 billion neurons
connected to many other neurons.
• The fundamental unit of a NN is a node, based on
the biological neuron of a human brain.
Deep Learning (DL)
Deep NN
• These are NN with more than two layers.
• 'Deep' - no. of hidden layers.
Some DL Architectures
Designing a NN
• Movement of information
in a NN happens in two
stages
(feed)forward propagation
and backpropagation
Perceptron
A single-layer perceptron is the basic unit of a neural network. A perceptron consists
of input values, weights and a bias, a weighted sum and activation function.

• A perceptron works by taking in some numerical inputs along with what is known
as weights and a bias.
• It then multiplies these inputs with the respective weights(this is known as the
weighted sum).
• These products are then added together along with the bias.
• The activation function takes the weighted sum and the bias as inputs and
returns a final output.
Assume we have a single neuron and three inputs x1, x2, x3 multiplied by the
weights w1, w2, w3 respectively as shown below
given the numerical value of the inputs and the weights, there is a function, inside
the neuron, that will produce an output.

what if we wanted the outputs to fall into a certain range say 0 to 1.

An activation function is a function that converts the input given (the input, in this
case, would be the weighted sum) into a certain output based on a set of rules.
Designing a NN
Multi-Layer Perceptron
MLP : Multi Layer Perceptron
Build a network with 2 input neurons, 3 hidden neurons, 2 output neurons, and 4 observations in training
set.

Use same number of layers and neurons but reduce the number of observations in dataset to 1 instance:
Activation Functions
What is an Activation Function?

• They basically decide whether a neuron should be activated or not.

• Whether the information/input that the neuron is receiving is relevant for
the given prediction or should it be ignored.
• Input to the activation function is
• The activation function is the non linear transformation that we do over
the input signals of hidden neurons.
• This transformed output is then sent to the next layer of neurons as
input.

• A neural network without an activation function is essentially just a

linear regression model.

• The activation function does the non-linear transformation to the input

making it capable to learn and perform more complex tasks.

• This is applied to the hidden neurons

Need for Activation Function

Purpose of Activation Functions is to introduce non-linearities in the network

Types of Activation functions with Neural Networks

The Activation Functions can be basically divided into different types-

1. Binary Step functions
2. Linear Activation Function
3. Non-linear Activation Functions
1. Binary Step Function
• A binary step function is a threshold-based activation function.
• It uses a threshold to decide whether a neuron should be activated or
not
• If the input to the activation function (Y) is above (or below) a certain
threshold, the neuron is activated and sends exactly the same signal to
the next layer.
• Otherwise, the neuron is not activated. I.e., signal is not passed to the
next layer.

Activation function f(x) = “activated” if

Y > threshold else not
Alternatively, f(x) = 1 if Y> threshold, 0
otherwise
Disadvantages of Binary Step Functions :

1. They don't provide multi-value outputs – not

suitable for multi-class classification
2. The gradient of the step function is zero, this
introduces some problem in the backpropagation
process
2. Linear Activation Function
• Also known as identity function.
• In Linear Activation Function, the
dependent Variable has a direct,
proportional relationship with the
independent variable.
• The output is proportional to the
input.
Equation : f(x) = x
Range : (-infinity to infinity)

It doesn’t help with the complexity

or various parameters of usual data
that is fed to the neural networks.

• The output of the functions will not be confined between any range.
Disadvantages of Linear Activation Function

• The gradient of the function doesn't involve the

input (x)
• Hence it is difficult during backpropagation to
identify the neuron's whose weight have to be
adjusted
• The neuron passes the signal as it is to the
next layer
• The last layer will be a linear function of the first
layer.
• This linear activation function is generally used by
the neurons in the input layer of NN.
Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions.

It makes it easy for the model

to generalize or adapt with
variety of data and to
differentiate between the
output

The Nonlinear Activation Functions are mainly divided on the basis of their range
or curves
Advantages of Non-Linear Activation Functions

• The gradient of the function involves input 'x'.

• Hence it is easy to understand which weights of

the input neurons have to be adjusted, during
backpropagation to give a better prediction
1. Sigmoid or Logistic Activation Function
Input : a real number

Output : a number between 0 to 1

The main reason why we use sigmoid function is because it exists between (0 to
1). Therefore, it is especially used for models where we have to predict the probability as
an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is
the right choice.
Smaller the input number (more
negative) 0
Adds Non-Linearity
Greater the input number (more
positive) 1
Disadvantages of Sigmoid Activation Function

• The gradient of the function has a significant value,

only for inputs between 3 and –3.
• For inputs out of this range, the gradient is small, and
eventually it becomes zero.
• The network stops learning and suffers from
vanishing gradient problem
2. Tanh or hyperbolic tangent Activation Function

• The output range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s -
shaped).
• Tanh is zero
centered.
• Negative inputs are
mapped strongly
negative
• Positive inputs are
mapped strongly
positive
• Zero inputs are
mapped near zero

• Both tanh and logistic

sigmoid activation
Fig: tanh v/s Logistic Sigmoid functions are used in
feed-forward nets.
Disadvantages of Tanh Activation Function

• Gradient is very steep, but eventually becomes zero

• The network stops learning and suffers from
vanishing gradient problem
• But tanh is zero centered and the gradients move in all
directions.
• Hence tanh non-linearity is preferred over sigmoid
Comparison of Sigmoid and Tanh Activation Functions ….
• For integers between –6 to + 6
Comparison of Sigmoid and Tanh Activation Functions...
• For integers between –6 to + 6
• Data is centered around zero for tanh meaning, Mean of the input data is zero
• Training of the neural network converges faster, if the inputs to the neurons in
each layer have a mean of zero and a variance of 1 and decorrelated.
• Since the input to each layer comes from the previous layer, it is important
that the output of the previous layers (input to the next layers) are centered
around zero.
3. ReLU (Rectified Linear Unit) Activation Function

• The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks or deep learning.
• The ReLU is half rectified (from bottom). R(z) is zero when z is less than
zero and R(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• Any negative input given to the ReLU activation function turns the value into
zero immediately in the graph, which in turns affects the resulting graph by
not mapping the negative values appropriately.
Disadvantages of ReLU :

• For negative inputs, the gradient is zero.

• Hence during backpropagation, the weights and bias of some neurons are not
updated.
• This creates dead neurons, which never get activated
• This is known as "Dying ReLU problem"
4. Leaky ReLU/Parametric ReLu
• It is an attempt to solve the dying ReLU problem

Fig : ReLU v/s Leaky ReLU

• The gradient has a slope for negative inputs .
• The leak helps to increase the range of the ReLU function.
• Usually, the value of a is 0.1 (Leaky ReLU) or some other value a
• When a is not 0.01 then it is called Randomized/Parametric ReLU.
f(x) = max(αx, x)
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
Advantages and Disadvantages of Leaky ReLU :

• For negative inputs, the gradient is a non-zero value

• Hence during backpropagation, the weights and bias of all neurons
are updated. No dead neurons
• The predictions made for negative inputs are not consistent.
• Since the gradient is a very small value for negative inputs, learning of model
parameters is time consuming
•Sigmoid functions and their combinations generally work better in the
case of classifiers

•Sigmoids and tanh functions are sometimes avoided due to the vanishing
gradient problem

•ReLU function is a general activation function and is used in most cases

these days. ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations and activates only few
neurons

•If we encounter a case of dead neurons in our networks the leaky ReLU
function is the best choice

•Always keep in mind that ReLU function should only be used in the
hidden layers. At current time, ReLu works most of the time as a general
approximator
• Variants of ReLU
• Leaky ReLU
• Parametric ReLU
• Exponential Linear Unit
SoftMax Activation Function

Softmax is an activation function that scales numbers/logits into probabilities.

The output of a Softmax is a vector (say v ) with probabilities of each possible
outcome. The probabilities in vector v sum to one for all possible outcomes or
classes.

Used at the end of network in Multi class classification

Activation Function
Activation Functions
Gradients and Activation Functions

• When constructing Artificial Neural Network (ANN) models, one of the key
considerations is to select an activation functions for the hidden and output
layers that are differentiable. I,e their derivatives should not be zero

• The gradient/derivative of the activation function is required during

backpropagation
• To update the weights of the neurons
• To determine how much and in what direction (+/-) the weights have to
be adjusted
Complete This !!!!!
# Activation Function Properties Pros Cons

1 Sigmoid
2 Softmax
3 ReLu
4 Leaky ReLu
6 TanH

Tip 1:
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax
activation distributes the probability throughout each output node.

Which to use when and Where ?????

LOSS FUNCTIONS

From Word Doc

Loss/Cost/Objective/Error Functions
# Loss Function Type of Loss Properties Pros Cons
Function
1 MSE/Quadratic Regression
Loss/L2 Loss
2 Mean Absolute
Error/L1 Loss
3 Mean Bias Error

4 Hinge Loss/Multi
class SVM Loss
5 Cross Entropy Classification
Loss/Negative Log
Likelihood
6 Hubber

Which to use when and With what ?????

Cross Entropy Loss

P : Actual Probability
Q : Predicted Probability

Entropy :
Loss Functions
BACK - PROPAGATION

10/17/2022
74
C = Loss = Mean Squared Error()

10/17/2022
75
10/17/2022
76
10/17/2022
77
10/17/2022
78
10/17/2022
79
10/17/2022
80
Optimization
Given an function f(x), an optimization algorithm help in either minimizing or
maximizing the value of f(x).

In Deep learning, optimization algorithms are used to train the neural network by
optimizing the cost function J. The cost function is defined as:

• The value of cost function J is the mean of the loss L between the predicted value
y’ and actual value y.
• The value y’ is obtained during the forward propagation step and makes use of the
Weights W and biases b of the network.
• With the help of optimization algorithms, we minimize the value of Cost Function J
by updating the values of the trainable parameters W and b.
10/17/2022
82
10/17/2022
83
Gradient Descent
Batch Gradient Descent

10/17/2022
85
• Batch Gradient Descent involves calculations
over the full training set at each step as a result
of which it is very slow on very large training
data.
• Thus, it becomes very computationally expensive
to do Batch GD.
10/17/2022
87
10/17/2022
88
• In Stochastic Gradient Descent (SGD), we consider just one example at a
time to take a single step. We do the following steps in one epoch for SGD:
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
•
• Drawback:
• SGD takes more number of iterations compared to GD to reach minimum and
also contains some noise when compared to Gradient Descent.
• As SGD computes derivatives of only 1 point at a time, the time taken to
complete one epoch is large compared to Gradient Descent algorithm.
Mini Batch Stochastic Gradient
Descent
• MB-SGD is an extension of SGD algorithm.
• It is also common to sample a small number of data points instead of just
one point at each step and that is called “mini-batch” gradient descent. Mini-
batch tries to strike a balance between the goodness of gradient descent and
speed of SGD.
• It overcomes the time-consuming complexity of SGD by taking a batch of
points / subset of points from dataset to compute derivative.
• after creating the mini-batches of fixed size, we do the following steps in one
epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Drawback is the update of weights is much noisier because the derivative is
not always towards minima.
Types - Gradient Descent

Batch GD : θ=θ−η⋅∇θJ(θ)

SGD : θ=θ−η⋅∇θJ(θ;x(i);y(i))

Mini Batch : θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))

Batch Vs Stochastic Vs Mini Batch
Optimization
Gradient descent is an optimization algorithm often used for finding the weights

SGD is one of many optimization methods, namely first order optimizer,

meaning, that it is based on analysis of the gradient of the objective.

In gradient descent one is trying to reach the minimum of the loss function with
respect to the parameters using the derivatives calculated in the back-propagation.

The easiest way would be to adjust the parameters by substracting its corresponding
derivative multiplied by a learning rate, which regulates how much you want to move
in the gradient direction.

The three main flavors of gradient descent are batch, stochastic, and mini-batch.

Backpropagation is an efficient method of computing gradients in directed graphs of

computations, such as neural networks.

This is not a learning method, but rather a nice computational trick which is often
used in learning methods.
This is actually a simple implementation of chain rule of derivatives, which simply
gives you the ability to compute all required partial derivatives in linear time
Trained with SGD using backprop as a gradient computing technique
Back Propagation
Back Propagation
The goal of back Propagation is to optimize the weights so that the neural network can learn how to correctly map
arbitrary inputs to outputs.

The Forward Pass

Total Error
Back Propagation
Backward Pass
Consider . , We want to know how much a change in affects
the total error, (Gradient w.r.t )

Applying Chain Rule

Back Propagation

Next, how much does the output of change with respect to its total net input?
What is a gradient ?

• a gradient is a measure of how much the output variable changes for

a small change in the input.
• this gradient is then used to update/learn the model parameters —
weights and biases
• the parameter updation rule is

• if the derivative term in the above equation is too small,there will be

very small change in Wx.
• Hence new and old weights are almost same. No learning.
• The weights of the initial layers would continue to remain
unchanged (or only change by a negligible amount), no matter how
many epochs you run with the backpropagation algorithm.
Problem of Vanishing Gradient
VANISHING GRADIENT PROBLEM

• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.
• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
• when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem
• In networks with few layers and sigmoid activation function, there is
no problem of vanishing gradient
• when more layers are used, it can cause the gradient to be too small
for training to work effectively.
• Gradients of neural networks are found using backpropagation
• backpropagation finds the derivatives of the network by moving layer
by layer from the final layer to the initial one
• By the chain rule, the derivatives of each layer are multiplied down
the network (from the final layer to the initial) to compute the
derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid
function, n small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to
the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session
• Since these initial layers are often crucial to recognizing the core
elements of the input data, it can lead to overall inaccuracy of the
whole network.
Ways to detect whether your deep network is suffering from the
vanishing gradient problem: -

 The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training
does not improve the model.

 The weights closer to the output layer of the model would witness more of
a change whereas the layers that occur closer to the input layer would not
change much (if at all).

 Model weights shrink exponentially and become very small when training
the model.

 The model weights become 0 in the training phase.

Vanishing Gradient Problem
Few Solutions:
1. Use other activation functions, such as ReLU,
which doesn’t cause a small derivative
2. Residual networks (ResNet)
• Use bypass/skip connections to bypass
information from few layers.
• Using these connections, information can be
transferred from layer n to layer n+t
• to perform this, the activation function of layer n
is connected to the activation function of n+t.
• This causes the gradient to pass between the
layers without any modification in size.
• Residual connection directly adds the value at the
beginning of the block, x, to the end of the block
(F(x)+x)
• This residual connection doesn’t go through
activation functions that “squashes” the
derivatives, resulting in a higher overall derivative
of the block.
3. Batch Normalization :

• Vanishing gradients usually happen while using the Sigmoid or Tanh activation
functions in the hidden layer units.
• Looking at the function plot below, we can see that when inputs become very
small or very large, the sigmoid function saturates at 0 and 1 and the tanh
function saturates at -1 and 1.
• In both these cases, their derivatives are extremely close to 0.
• these ranges/regions of the function “saturating regions” or “bad regions”.
• Thus, if your input lies in any of the saturating regions, then it has almost no
gradient to propagate back through the network.
• batch normalization can be simply visualized as an additional layer in
the network that normalizes the data (using a mean and standard
deviation) before feeding it into the hidden unit activation function.
• Batch normalization normalizes the input and ensures that|x| lies within
the “good range” (marked as the green region) and doesn’t reach the
outer edges of the sigmoid function.
• If the input is in the good range, then the activation does not saturate,
and thus the derivative also stays in the good range, i.e- the derivative
value isn’t too small.
• Thus, batch normalization prevents the gradients from becoming too
small and makes sure that the gradient signal is heard.
Exploding Gradient Problem
Exploding gradients are a problem where large error gradients accumulate and
result in very large updates to neural network model weights during training

Results in model being unstable and unable to learn from your training data
Ways to detect whether your deep network is suffering from the
exploding gradient problem: -

 Model weights grow exponentially and become very large when training the
model.
 The model weights become NaN in the training phase.
Approaches to address both vanishing and exploding gradient
problems
1. Reducing the amount of Layers
This is solution could be used in both, scenarios (exploding and vanishing
gradient). However, by reducing the amount of layers in our network, we give up
some of our models complexity, since having more layers makes the networks
more capable of representing complex mappings.

2. Gradient Clipping (Exploding Gradients)

Checking for and limiting the size of the gradients whilst our model trains is
another solution.

3. Weight Initialization
A more careful initialization choice of the random initialization for your network
tends to be a partial solution, since it does not solve the problem completely.
Training a NN in Keras
Data Set : Pima Indians Diabetes Data Set

It describes patient medical record data for Pima Indians and whether
they had an onset of diabetes within five years.
It is a binary classification problem (onset of diabetes as 1 or not as 0).
The input variables that describe each patient are numerical and have
varying scales.
Below lists the eight attributes for the dataset:
1. Number of times pregnant. 2. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test. 3. Diastolic blood pressure (mm Hg). 4.
Triceps skin fold thickness (mm). 5. 2-Hour serum insulin (mu U/ml). 6.
Body mass index. 7. Diabetes pedigree function. 8. Age (years). 9. Class,
onset of diabetes within five years.

Sample records:
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
Neural Network Structure
from google.colab import files
uploaded = files.upload()

# first neural network with keras tutorial

import keras
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import pandas as pd

df = pd.read_csv("/content/pima-indians-diabetes.csv")

# split into input (X) and output (y) variables

X = df.iloc[:,0:8]
y = df.iloc[:,8]

# define the keras model

model = Sequential()
#input_layer = Dense(12, input_dim = 8, activation = 'relu')
#model.add(input_layer)
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
# compile the keras model and specify the training parameters of the architecture
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit the keras model on the dataset

model.fit(X, y, epochs=150, batch_size=16)

#Output
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

#Output
model.get_config()
Calculating the No. of Trainable Parameters

Ex1: With one hidden layer

No. of input units i = 3, hidden units h = 4 and
output units o = 2

Hence, no. of trainable parameters :

Summing it all,

3×4+4×2+1×4+1×2
=3×4+4×2+4+2
=i×h+h×o+h+o

Thus, the total number of parameters in a feed-forward neural network with

one hidden layer is given by:
(i × h + h × o) + h + o

Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are
respectively 3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.

Ans :
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively
3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.

Ans :
• Number of connections between the first and second layer: 3 × 5 = 15, which is
nothing but the product of i and h1.
• Number of connections between the second and third layer: 5 × 6 = 30, which is
nothing but the product of h1 and h2.
• Number of connections between the third and fourth layer: 6 × 4 = 24, which is
nothing but the product of h2 and h3.
• Number of connections between the fourth and fifth layer: 4 × 2= 8, which is
nothing but the product of h3 and o.
• Number of connections between the bias of the first layer and the neurons of
the second layer (except bias of the second layer): 1 × 5 = 5, which is nothing
but h1.
• Number of connections between the bias of the second layer and the neurons
of the third layer: 1 × 6 = 6, which is nothing but h2.
• Number of connections between the bias of the third layer and the neurons of
the fourth layer: 1 × 4 = 4, which is nothing but h3.
• Number of connections between the bias of the fourth layer and the neurons of
the fifth layer: 1 × 2 = 2, which is nothing but o.
• Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.

• To generalize this equation and find a formula.

3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
=3×5+5×6+6×4+4×2+5+6+4+2
= i × h1 + h1 × h2 + h2 × h3+ h3 × o + h1 + h2 + h3+ o

• Thus, the total number of parameters in a feed-forward neural network with three
hidden layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Calculate the number of trainable parameters for this model :

Bias is initialised to Zero

Hyperparameters
• Hyperparameters are the variables which determines the
network structure(Eg: Number of Hidden Units) and the
variables which determine how the network is trained(Eg:
Learning Rate).
• Hyperparameters are set before training(before optimizing
the weights and bias
Hyper parameters
1. No. of hidden layers and units
2. DropOut

• Deep learning neural networks are likely to quickly overfit a training

dataset with few examples.
• A larger/deeper NN is also likely to overfit and hence poor generalization.
• Dropout is a regularization method used to prevent model overfitting.
• It simulates a large number of different network architectures from a
single model by randomly dropping out few neurons from each layer
during each training iteration.
• It is a very computationally cheap and remarkably effective regularization
method to reduce overfitting and improve generalization error in deep
neural networks of all kinds.
• It can be used with most types of layers, such as dense fully
connected layers, convolutional layers, and recurrent layers such as
the long short-term memory network layer.
• Dropout may be implemented on any or all hidden layers in the
network as well as the visible or input layer. It is not used on the
output layer.
• The term “dropout” refers to dropping out units (hidden and
visible) in a neural network.
• Dropout is not used after training when making a prediction with
the fit network.
• The dropout hyperparameter specifies the probability at which outputs
of the layer are dropped out (inversely, the propability at which inputs
to the layers are retained)

• a small dropout value of 20%-50% of neurons is generally used.

• A common value is a probability of 0.5 for retaining the output of each

node in a hidden layer(dropout is 0.5) and a value close to 1.0, such as
0.8, for retaining inputs from the visible layer (dropout is 0.2)

• The weights of the network will be larger than the normal because of
dropout.

• Hence weights are scaled down using the chosen dropout rate.

• The network can then be used as per normal to make predictions.

3. Weight Initialization

• different weight initialization schemes according to the activation function used

on each layer
• For a NN with L layers, there are L-1 hidden layers ,1 input and output layer
each.
• The parameters (weights and biases) for layer l are represented as
• These methods serve as good starting points for initialization and mitigate
the chances of exploding or vanishing gradients.
• They set the weights neither too much bigger than 1, nor too much less
than 1.
• So, the gradients do not vanish or explode too quickly. They help avoid
slow convergence

Source: Neural networks and deep

learning, Andrew Ng (Coursera.org).
• REFER THE PDF FOR THE OTHER HYPERPARAMETERS
Epochs
One Epoch is when an ENTIRE dataset is passed forward and
backward through the neural network only ONCE.

Batch Size
Total number of training examples present in a single batch.
Designing a DNN
DNN are NNs that are designed to mimic human intelligence
Points to consider while designing :

1. which layer to use?

2. How many neurons to use in each layer?
3. How to arrange the layers?
4. Which Activation function to use?
5. Others

Module1
No ratings yet
Module1
124 pages
4 - Activation Functions in Neural Networks
No ratings yet
4 - Activation Functions in Neural Networks
12 pages
26- netinput activation function forward and back propogation
No ratings yet
26- netinput activation function forward and back propogation
41 pages
Unit 2_Activation Function_PR
No ratings yet
Unit 2_Activation Function_PR
22 pages
Activation Function
No ratings yet
Activation Function
4 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Unit 5 Activation Function
No ratings yet
Unit 5 Activation Function
15 pages
UNIT V NEURAL NETWORKS
No ratings yet
UNIT V NEURAL NETWORKS
35 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
activatn fn 2
No ratings yet
activatn fn 2
10 pages
Ann
No ratings yet
Ann
40 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Mod 2.3 - Activation Function
No ratings yet
Mod 2.3 - Activation Function
9 pages
Activation Function
No ratings yet
Activation Function
31 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Activation Function
No ratings yet
Activation Function
44 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Activation Function
No ratings yet
Activation Function
9 pages
Need and Use of Activation Functions in Anndeep Learning
No ratings yet
Need and Use of Activation Functions in Anndeep Learning
7 pages
activation fn
No ratings yet
activation fn
15 pages
Aditya Jain NN Assignment
No ratings yet
Aditya Jain NN Assignment
13 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
Fundamentals Deep Learning Activation Functions When To Use Them
No ratings yet
Fundamentals Deep Learning Activation Functions When To Use Them
15 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
Activation Function
No ratings yet
Activation Function
43 pages
4 4 Choosing The Right Activation Function For Neural Networks
No ratings yet
4 4 Choosing The Right Activation Function For Neural Networks
25 pages
Activation
No ratings yet
Activation
7 pages
Mod 2.3 - Activation Function, Loss Functions
No ratings yet
Mod 2.3 - Activation Function, Loss Functions
12 pages
M2 PPT
No ratings yet
M2 PPT
84 pages
Unit 3 Deep Learning
No ratings yet
Unit 3 Deep Learning
11 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
lecture 9-NN- modified
No ratings yet
lecture 9-NN- modified
94 pages
4. ANNs
No ratings yet
4. ANNs
57 pages
Activation Funtions
No ratings yet
Activation Funtions
26 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
Performance Analysis of Various Activation Functio
No ratings yet
Performance Analysis of Various Activation Functio
7 pages
Activation Function
No ratings yet
Activation Function
36 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Types of Neural Network Activation Functions_ How to Choose_ (1)
No ratings yet
Types of Neural Network Activation Functions_ How to Choose_ (1)
36 pages
Activation Functions
No ratings yet
Activation Functions
9 pages
Lect 5- Non Linear Activation Functions
No ratings yet
Lect 5- Non Linear Activation Functions
41 pages
Artificial Neural Networks(ANN)
No ratings yet
Artificial Neural Networks(ANN)
67 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
5 TH
No ratings yet
5 TH
22 pages
DL Answers
No ratings yet
DL Answers
24 pages
Lect 5-6activation Function
No ratings yet
Lect 5-6activation Function
15 pages
CS 522 Selected Topics in CS: Lecture 07 - Artificial Neural Network
No ratings yet
CS 522 Selected Topics in CS: Lecture 07 - Artificial Neural Network
52 pages
Functii de Activare1
No ratings yet
Functii de Activare1
89 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
12 Types of Neural Network Activation Functions
No ratings yet
12 Types of Neural Network Activation Functions
38 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Act_Fun
No ratings yet
Act_Fun
7 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
ML Mentorship Prahitha Movva V1
No ratings yet
ML Mentorship Prahitha Movva V1
5 pages
Activation Functions and Keras Metrics
No ratings yet
Activation Functions and Keras Metrics
31 pages
Unit 2
No ratings yet
Unit 2
18 pages
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet
3 Deep Learning Overview v3.5
No ratings yet
3 Deep Learning Overview v3.5
85 pages
Deep Learnong
No ratings yet
Deep Learnong
14 pages
Technical Report On DenseNet Architecture (Deep Learning Network Model)
No ratings yet
Technical Report On DenseNet Architecture (Deep Learning Network Model)
9 pages
Data Science Ai
No ratings yet
Data Science Ai
27 pages
Recurrent & Recursive Nets
No ratings yet
Recurrent & Recursive Nets
10 pages
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
100% (5)
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
41 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
ML 2 marks
No ratings yet
ML 2 marks
7 pages
Assignment EE5179 ME20B145 Report
No ratings yet
Assignment EE5179 ME20B145 Report
6 pages
5. DEEP UNIT 3 F (1)
No ratings yet
5. DEEP UNIT 3 F (1)
51 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
25 pages
Artificial Neural Network - Wikipedia
No ratings yet
Artificial Neural Network - Wikipedia
14 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
AIMLDL Questions
No ratings yet
AIMLDL Questions
5 pages
Day 1 Special Bonus
No ratings yet
Day 1 Special Bonus
23 pages
LSTM_ppt
No ratings yet
LSTM_ppt
22 pages
Cs3491-Artificial Intelligence and Machine Learning-1221091049-Unit 5 Aiml
No ratings yet
Cs3491-Artificial Intelligence and Machine Learning-1221091049-Unit 5 Aiml
38 pages
Thesis Final Presentation
No ratings yet
Thesis Final Presentation
33 pages
gradient_exploding_vanishing_problem_v2
No ratings yet
gradient_exploding_vanishing_problem_v2
3 pages
Activations
No ratings yet
Activations
8 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
ML Unit4
No ratings yet
ML Unit4
32 pages
DL Endsem 2024 FlyHigh Services
No ratings yet
DL Endsem 2024 FlyHigh Services
18 pages
POA - Tracker
No ratings yet
POA - Tracker
60 pages
Deep Learning Handson
No ratings yet
Deep Learning Handson
65 pages
thesis (52) (1)
No ratings yet
thesis (52) (1)
76 pages
DL Unit 4 Notes
No ratings yet
DL Unit 4 Notes
21 pages
Tensorflow Playground:: Exercise 2
No ratings yet
Tensorflow Playground:: Exercise 2
2 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module1 - Upto Loss Function

Uploaded by

Module1 - Upto Loss Function

Uploaded by

CSE235-Introduction

what if we wanted the outputs to fall into a certain range say 0 to 1.

• They basically decide whether a neuron should be activated or not.

• A neural network without an activation function is essentially just a

• The activation function does the non-linear transformation to the input

• This is applied to the hidden neurons

Purpose of Activation Functions is to introduce non-linearities in the network

The Activation Functions can be basically divided into different types-

Activation function f(x) = “activated” if

1. They don't provide multi-value outputs – not

It doesn’t help with the complexity

• The gradient of the function doesn't involve the

It makes it easy for the model

• The gradient of the function involves input 'x'.

• Hence it is easy to understand which weights of

Output : a number between 0 to 1

• The gradient of the function has a significant value,

• Both tanh and logistic

• Gradient is very steep, but eventually becomes zero

• For negative inputs, the gradient is zero.

Fig : ReLU v/s Leaky ReLU

• For negative inputs, the gradient is a non-zero value

•ReLU function is a general activation function and is used in most cases

Softmax is an activation function that scales numbers/logits into probabilities.

Used at the end of network in Multi class classification

• The gradient/derivative of the activation function is required during

Which to use when and Where ?????

From Word Doc

Which to use when and With what ?????

Mini Batch : θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))

SGD is one of many optimization methods, namely first order optimizer,

Backpropagation is an efficient method of computing gradients in directed graphs of

The Forward Pass

Applying Chain Rule

• a gradient is a measure of how much the output variable changes for

• if the derivative term in the above equation is too small,there will be

 The model weights become 0 in the training phase.

2. Gradient Clipping (Exploding Gradients)

# first neural network with keras tutorial

# split into input (X) and output (y) variables

# define the keras model

# fit the keras model on the dataset

Ex1: With one hidden layer

Hence, no. of trainable parameters :

Thus, the total number of parameters in a feed-forward neural network with

• To generalize this equation and find a formula.

Bias is initialised to Zero

• Deep learning neural networks are likely to quickly overfit a training

• a small dropout value of 20%-50% of neurons is generally used.

• A common value is a probability of 0.5 for retaining the output of each

• The network can then be used as per normal to make predictions.

• different weight initialization schemes according to the activation function used

Source: Neural networks and deep

1. which layer to use?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.