DL Unit-3

1.
(a)AdaGrad (Adaptive Gradient Descent)
We have learned about gradient descent and its various types. We looked into Stochastic gradient
descent, which is excellent when we have a huge dataset.
However, our dataset may contain both sparse and dense features.
Sparse features have very few non-zero values and require a
higher learning rate. Dense features chiefly have non-zero values,
requiring a lower learning rate. Stochastic gradient descent
considers the same learning rate for each feature. To tackle this
problem, we use AdaGrad. AdaGrad uses different learning rates
for each feature. Let us see how AdaGrad actually works.
Working of AdaGrad
AdaGrad uses the gradients of all the previous steps to calculate the learning rate of a particular feature
at each step.
The learning rate in AdaGrad for step ‘t ’is:
Here, 𝜂 is the initial learning rate. 𝝐 is a small positive value. It is added in the denominator to avoid
division by zero if Vt becomes zero. Vt is given by:
The gradients of all the previous steps are used to calculate Vt. Now, if we have a dense value that is
frequently updated. Higher will be the gradient sum collected, and the value of Vt will be high, lowering
the learning rate. Whereas, if we have a sparse feature, it will be less updated, and its learning rate will
be higher than the dense feature. Each feature has its own learning rate for each iteration.
The equation below shows us the updation rule of weight(w) for the t+1th iteration. 𝜂 is the initial
learning rate of the feature.
1.(b)Describe RMS PROP
RMSprop was described as a way to speed up Gradient Descent.
However, the example describes seems to be a special case.
RMSprop as described seems to have the effect scaling the magnitude of gradient descent steps to be closer to each other in each
dimension. More of a normalization approach.
Andrew picked a case where:

Magnitude of B was relatively high and more incorrect
Magnitude of W was relatively low and more correct
By normalizing:
The higher magnitude of B becomes relatively smaller (better)
The lower magnitude of W becomes relatively larger (better)
If we then increase the learning rate we are improving performance
It seems like we could easily make a counter case:
RMSProp is an unpublished adaptive learning rate optimizer proposed by Geoff Hinton. The motivation is that
the magnitude of gradients can differ for different weights, and can change during learning, making it hard to
choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient
and adjusting the weight updates by this magnitude. The gradient updates are performed as:
[𝑔2]𝑡=𝛾[𝑔2]𝑡−1+(1−𝛾)𝑔𝑡2
2.(a)Explain Drop Out
The term “dropout” refers to dropping out units (both hidden and visible) in a neural
network.
Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of
certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not
considered during a particular forward or backward pass.
More technically, At each training stage, individual nodes are either dropped out of the net
with probability 1-p or kept with probability p, so that a reduced network is left; incoming
and outgoing edges to a dropped-out node are also removed.
Training Phase:
Training Phase: For each hidden layer, for each training sample, for each iteration, ignore
(zero out) a random fraction, p, of nodes (and corresponding activations).
Testing Phase:
Use all activations, but reduce them by a factor p (to account for the missing activations
during training).
Experiment in Keras
Let’s try this theory in practice. To see how dropout

works, I build a deep net in Keras and tried to
validate it on
the CIFAR-10
dataset. The deep
network is built had
three
convolution
layers of size 64,
128 and 256
followed by two
densely
connected layers of size 512 and an output layer dense layer of size 10 (number of classes in
the CIFAR-10 dataset).
From the above graphs we can conclude that with increasing the dropout, there is some
increase in validation accuracy and decrease in loss initially before the trend starts to go
down.
There could be two reasons for the trend to go down if dropout fraction is 0.2:
1. 0.2 is actual minima for the this dataset, network and the set parameters used
2. More epochs are needed to train the networks.
2.(b) Explain batch normalization
Batch normalization is a technique that normalizes the activations of a layer within a

mini-batch during the training of deep neural networks.
It operates by calculating the mean and variance of the activations for each feature in
the mini-batch and then normalizing the activations using these statistics.
The normalized activations are then scaled and shifted using learnable parameters,
allowing the model to adapt to the optimal activation distribution.
Batch Normalization is typically applied after the linear transformation of a layer (e.g.,
after the matrix multiplication in a fully connected layer or after the convolution operation
in a convolutional layer) and before the non-linear activation function (e.g., ReLU).
The key components of batch normalization are:
1. Mini-batch statistics: The mean and variance of the activations are calculated for each feature within the mini-batch.
2. Normalization: The activations are normalized by subtracting the mini-batch mean and dividing by the mini-batch
standard deviation.
3. Scaling and shifting: Learnable parameters (γ and β) are introduced to scale and shift the normalized activations,
allowing the model to learn the optimal activation distribution.
Batch normalization during training
Let’s start with the normalization step. We first calculate the mean and variance for
each feature in a mini-batch. These are the formulas we can use for the mean and the
variance.
We then use the mean and variance to normalize the activations. This is the
formula we can use, where the ε (lowercase epsilon) is a small constant
added for numerical stability:
After we’re done with the normalization step, we move on to the scaling
and shifting step. Using the learnable parameters γ and β, we shift the
normalized activations using this formula:
These parameters allow the model to learn the optimal activation

distribution.
Batch normalization during inference
To calculate the running mean and variance, we can use these two formulas,
where α is the momentum factor that controls the update rate of the running
statistics:
The running mean and variance are stored as model parameters and used
for normalization during inference. The scaling and shifting parameters (γ
and β) learned during training are also used during inference.
3.(a) Refer 4(a) and 4(b)
3.(b) What is the difference between first order and second order methods?
Certainly! Here's an expanded comparison table with additional points:

- **Scalability:** First-order methods are typically more scalable to large datasets and high-dimensional
models due to their computational efficiency and reduced memory requirements compared to second-
order methods, which often involve computing or approximating the Hessian matrix.
- **Convergence Speed:** Second-order methods can converge faster than first-order methods,
especially in scenarios where the objective function has complex curvature or is poorly conditioned. This
faster convergence is due to their ability to take into account curvature information beyond just the
gradient.
- **Robustness:** First-order methods, particularly SGD and its variants, are more robust to noisy
gradients and are widely used in deep learning for this reason. Second-order methods can be less robust
to noise and require more careful handling of numerical stability issues.
- **Implementation Complexity:** First-order methods are generally easier to implement and tune
compared to second-order methods, which involve additional considerations such as Hessian
approximation techniques, preconditioning, and regularization to ensure stability and efficiency.
- **Adaptability:** First-order methods like Adam and RMSprop incorporate adaptive learning rates and
momentum, which can enhance their performance in different optimization scenarios. Second-order
methods also have adaptations like limited-memory variants (e.g., L-BFGS) to improve efficiency and
scalability in large-scale optimization tasks.
4.(a) Explain Newton’s method

4.(b) Explain Conjugate gradient

DL Unit-3

Uploaded by

Copyright:

Available Formats

DL Unit-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Unit-3

Uploaded by

Copyright:

Available Formats

1.

(a)AdaGrad (Adaptive Gradient Descent)

The learning rate in AdaGrad for step ‘t ’is:

RMSprop was described as a way to speed up Gradient Descent.

However, the example describes seems to be a special case.

Andrew picked a case where:

If we then increase the learning rate we are improving performance

It seems like we could easily make a counter case:

Let’s try this theory in practice. To see how dropout

2. More epochs are needed to train the networks.

2.(b) Explain batch normalization

Batch normalization is a technique that normalizes the activations of a layer within a

The key components of batch normalization are:

Batch normalization during training

These parameters allow the model to learn the optimal activation

Batch normalization during inference

3.(a) Refer 4(a) and 4(b)

Certainly! Here's an expanded comparison table with additional points:

4.(a) Explain Newton’s method

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.