DL Unit-3
DL Unit-3
DL Unit-3
We have learned about gradient descent and its various types. We looked into Stochastic gradient
descent, which is excellent when we have a huge dataset.
However, our dataset may contain both sparse and dense features.
Sparse features have very few non-zero values and require a
higher learning rate. Dense features chiefly have non-zero values,
requiring a lower learning rate. Stochastic gradient descent
considers the same learning rate for each feature. To tackle this
problem, we use AdaGrad. AdaGrad uses different learning rates
for each feature. Let us see how AdaGrad actually works.
Working of AdaGrad
AdaGrad uses the gradients of all the previous steps to calculate the learning rate of a particular feature
at each step.
Here, 𝜂 is the initial learning rate. 𝝐 is a small positive value. It is added in the denominator to avoid
division by zero if Vt becomes zero. Vt is given by:
The gradients of all the previous steps are used to calculate Vt. Now, if we have a dense value that is
frequently updated. Higher will be the gradient sum collected, and the value of Vt will be high, lowering
the learning rate. Whereas, if we have a sparse feature, it will be less updated, and its learning rate will
be higher than the dense feature. Each feature has its own learning rate for each iteration.
The equation below shows us the updation rule of weight(w) for the t+1th iteration. 𝜂 is the initial
learning rate of the feature.
1.(b)Describe RMS PROP
RMSprop as described seems to have the effect scaling the magnitude of gradient descent steps to be closer to each other in each
dimension. More of a normalization approach.
By normalizing:
The higher magnitude of B becomes relatively smaller (better)
The lower magnitude of W becomes relatively larger (better)
RMSProp is an unpublished adaptive learning rate optimizer proposed by Geoff Hinton. The motivation is that
the magnitude of gradients can differ for different weights, and can change during learning, making it hard to
choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient
and adjusting the weight updates by this magnitude. The gradient updates are performed as:
[𝑔2]𝑡=𝛾[𝑔2]𝑡−1+(1−𝛾)𝑔𝑡2
2.(a)Explain Drop Out
The term “dropout” refers to dropping out units (both hidden and visible) in a neural
network.
Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of
certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not
considered during a particular forward or backward pass.
More technically, At each training stage, individual nodes are either dropped out of the net
with probability 1-p or kept with probability p, so that a reduced network is left; incoming
and outgoing edges to a dropped-out node are also removed.
Training Phase:
Training Phase: For each hidden layer, for each training sample, for each iteration, ignore
(zero out) a random fraction, p, of nodes (and corresponding activations).
Testing Phase:
Use all activations, but reduce them by a factor p (to account for the missing activations
during training).
Experiment in Keras
From the above graphs we can conclude that with increasing the dropout, there is some
increase in validation accuracy and decrease in loss initially before the trend starts to go
down.
There could be two reasons for the trend to go down if dropout fraction is 0.2:
1. 0.2 is actual minima for the this dataset, network and the set parameters used
The normalized activations are then scaled and shifted using learnable parameters,
allowing the model to adapt to the optimal activation distribution.
Batch Normalization is typically applied after the linear transformation of a layer (e.g.,
after the matrix multiplication in a fully connected layer or after the convolution operation
in a convolutional layer) and before the non-linear activation function (e.g., ReLU).
1. Mini-batch statistics: The mean and variance of the activations are calculated for each feature within the mini-batch.
2. Normalization: The activations are normalized by subtracting the mini-batch mean and dividing by the mini-batch
standard deviation.
3. Scaling and shifting: Learnable parameters (γ and β) are introduced to scale and shift the normalized activations,
allowing the model to learn the optimal activation distribution.
Let’s start with the normalization step. We first calculate the mean and variance for
each feature in a mini-batch. These are the formulas we can use for the mean and the
variance.
We then use the mean and variance to normalize the activations. This is the
formula we can use, where the ε (lowercase epsilon) is a small constant
added for numerical stability:
After we’re done with the normalization step, we move on to the scaling
and shifting step. Using the learnable parameters γ and β, we shift the
normalized activations using this formula:
To calculate the running mean and variance, we can use these two formulas,
where α is the momentum factor that controls the update rate of the running
statistics:
The running mean and variance are stored as model parameters and used
for normalization during inference. The scaling and shifting parameters (γ
and β) learned during training are also used during inference.
3.(b) What is the difference between first order and second order methods?
- **Convergence Speed:** Second-order methods can converge faster than first-order methods,
especially in scenarios where the objective function has complex curvature or is poorly conditioned. This
faster convergence is due to their ability to take into account curvature information beyond just the
gradient.
- **Robustness:** First-order methods, particularly SGD and its variants, are more robust to noisy
gradients and are widely used in deep learning for this reason. Second-order methods can be less robust
to noise and require more careful handling of numerical stability issues.
- **Implementation Complexity:** First-order methods are generally easier to implement and tune
compared to second-order methods, which involve additional considerations such as Hessian
approximation techniques, preconditioning, and regularization to ensure stability and efficiency.
- **Adaptability:** First-order methods like Adam and RMSprop incorporate adaptive learning rates and
momentum, which can enhance their performance in different optimization scenarios. Second-order
methods also have adaptations like limited-memory variants (e.g., L-BFGS) to improve efficiency and
scalability in large-scale optimization tasks.