DL Mod2
• When the first neuron in the network is stimulated, the input signal is
processed, and if it exceeds a particular threshold, the neuron is activated
and passes the signal on to the neurons to which it is connected.
• These neurons in turn may be activated and pass the signal on through the
rest of the network.
• Over time, the connections between the neurons are strengthened by
frequent use as we learn how to respond effectively.
• Machine learning is concerned with predicting a label based on some
features of a particular observation. In simple terms, a machine learning
model is a function that calculates y (the label) from x (the features): f(x)=y
• Because of the layered architecture of the network, this kind of model is
sometimes referred to as a multilayer perceptron.
• Additionally, notice that all neurons in the input and hidden layers are
connected to all neurons in the subsequent layers - this is an example of a
fully connected network.
• While creating a model like this, we must define an input layer that supports
the number of features our model will process, and an output layer that
reflects the number of outputs we expect it to produce.
• We can decide how many hidden layers we want to include and how many
neurons are in each of them;
• but we have no control over the input and output values for these layers -
these are determined by the model training process.
If we initialized all the weights with 0, then what happens is that the
derivative wrt loss function is the same for every weight in W[l], thus all
weights have the same value in subsequent iterations.
• This makes hidden layers symmetric and this process continues for all the n
iterations. Thus initialized weights with zero make your network no better
than a linear model.
II. Random Initialization (Initialized weights randomly)
• 1) Xavier/Glorot Initialization :
• Xavier Initialization is a Gaussian initialization heuristic that keeps the
variance of the input to a layer the same as that of the output of the layer.
This ensures that the variance remains the same throughout the network.
• It works well for sigmoid function
• i)Xavier Normal :
• Normal Distribution with Mean=0
Wij~N(0,std) where std=sqrt(2/(fan_in + fan_out))
• Here N is a Normal Distribution.
• ii)Xavier Uniform :
• Wij ~ D [-sqrt(6)/sqrt(fan_in+fan_out),sqrt(6)/sqrt(fan_in + fan_out)]
• Where D is a Uniform Distribution
• 2)He Init[Kaiming Initialization] :
• This weight initialization also has two variations. It works pretty well for
ReLU and LeakyReLU activation function.
• i)He Normal :
• Normal Distribution with Mean=0
Wij ~ N(mean,std) , mean=0 , std=sqrt(2/fan_in)
• Where N is a Normal Distribution
• ii)He Uniform :
• Wij ~ D[-sqrt(6/fan_in),sqrt(6/fan_in)]
• Where D is a Uniform Distribution
• Benefits of using these heuristics:
• All these heuristics serve as good starting points for weight initialization and
they reduce the chances of exploding or vanishing gradients.
• All these heuristics do not vanish or explode too quickly, as the weights are
neither too much bigger than 1 nor too much less than 1.
• They help to avoid slow convergence and ensure that we do not keep
oscillating off the minima.
Gradient Descent
• Gradient Descent is one of the popular techniques to perform optimization.
• It's based on a convex function and tweaks its parameters iteratively to
minimize a given function to its local minimum.
• Gradient Descent is an optimization algorithm for finding a local minimum
of a differentiable function.
• We start by defining initial parameter's values and from there gradient
descent uses calculus to iteratively adjust the values so they minimize the
given cost-function.
• The above equation computes the gradient of the cost function J(θ) w.r.t. to
the parameters/weights θ for the entire training dataset:
• "A gradient measures how much the output of a function changes if you
change the inputs a little bit."
• Learning rate determines how big the steps are gradient descent takes into
the direction of local minimum. That will tells us about how fast or slow we
will move towards the optimal weights.
• When we initialize learning rate we set an apporpriate value which is neither
too low nor too high.
• Advantages of Gradient Descent
• Easy Computation
• Easy to implement
• Easy to understand
• Disadvantages of Gradient Descent
• May trap at local minima
• Weights are changed after calculation the gradient on whole dataset,
so if dataset is too large then it may take years to converge to the
• Requires large memory to calculate gradient for whole dataset
• 3 Types of Gradient Descent
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini Batch Gradient Descent
• Batch Gradient Descent
• In batch gradient descent we uses the entire dataset to calculate gradient of
the cost function for each epoch.
• That's why the convergence is slow in batch gradient descent.
• SGD - Stochastic Gradient Descent
• SGD algorithm is an extension of the Gradient Descent and it overcomes
disadvantages of gradient descent algorithm.
• SGD derivative is computed taking one observation at a time. So if dataset
contains 100 observations then it updates model weigts and bias 100 times in
1 epoch.
SGD performs parameter updation for each training example x(i)
and label y(i):
• Advantages of SGD
• Memory requirement is less compared to Gradient Descent algorithm.
• Disadvantages of SGD
• May stuck at local minima
• Time taken by 1 epoch is large compared to Gradient Descent
• Advantages of Mini Batch Gradient Descent
• Less time taken to converge the model
• Requires medium amount of memory
• Frequently updates the model parameters and also has less variance.
• Disadvantages of Mini Batch Gradient Descent
• If the learning rate is too small then convergence rate will be slow.
• It doesn't guarantee good convergence
Problems with Gradient Descent
• There are a few problems that can occur when using gradient descent:
• I. Local Minima:
• Gradient descent can get stuck in local minima, points that are not the global
minimum of the cost function but are still lower than the surrounding points.
This can occur when the cost function has multiple valleys, and the
algorithm gets stuck in one instead of reaching the global minimum.
• II. Saddle Points:
• A saddle point is a point in the cost function where one dimension has a
higher value than the surrounding points, and the other has a lower value.
• Gradient descent can get stuck at these points because the gradients in one
direction point towards a lower value, while those in the other direction
point towards a higher value.
• III. Plateaus:
• A plateau is a region in the cost function where the gradients are very small
or close to zero. This can cause gradient descent to take a long time or not
• IV. Oscillations:
• Oscillations occur when the learning rate is too high, causing the algorithm
to overshoot the minimum and oscillate back and forth.
• V. Slow convergence:
• Gradient descent can converge very slowly when the cost function is
complex or has many local minima. This means the algorithm may take a
long time to find the global minimum.
• VI. Stochasticity:
• Deep neural networks with many layers can suffer from vanishing or
exploding gradients. This occurs when the gradients become very small or
large, respectively, as they are backpropagated through the layers. This can
make it difficult for the algorithm to update the weights and biases.
• Momentum helps to,
• Escape local minima and saddle points
• Aids in faster convergence by reducing oscillations
• Smooths out weight updates for stability
• Reduces model complexity and prevents overfitting
• Can be used in combination with other optimization algorithms for improved
• Momentum was introduced for reducing high variance in SGD.
• Instead of depending only on current gradient to update weight, gradient
descent with momentum replaces the current gradient with V (which stands
for velocity), the exponential moving average of current and past gradients.
• Momentum simulates the inertia of an object when it is moving, the
direction of the previous update is retained to a certain extent during the
update, while the current update gradient is used to fine-tune the final update
• One more hyperparameter is used in this method known as momentum
symbolized by ‘γ’.
• Momentum at time ‘t’ is computed using all previous updates giving more
weightage to recent updates compared to the previous update. This leads to
speed up the convergence.
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way .
• The same thing happens to our parameter updates: The momentum term
increases for dimensions whose gradients point in the same directions and
reduces updates for dimensions whose gradients change directions.
As a result, we gain faster convergence and reduced oscillation
• Advantages
• Converges faster than SGD
• All advantages of SGD
• Reduces the oscillations and high variance of the parameters
• Disadvantage
• One more extra variable is introduced that we need to compute for each
Here the alpha (t) denotes the different learning rates at each iteration, n is a
constant, and E is a small positive to avoid division by 0
● Learning rate changes adaptively, no human intervention is required
● One of the best algorithm to train on sparse data
● Learning rate is always decreasing which leads to slow convergence
● Due to small learning rate model eventually becomes unable to train
properly and couldn't acquire the required knowledge and hence accuracy of
the model is compromised.
• AdaDelta is an extension of AdaGrad. In AdaGrad learning rate became too
small that it might decay or after some time it approaches zero.
• AdaDelta was introduced to get rid of learning rate decaying problem.
• To deal with these problems, AdaDelta uses two state variables to store the
leaky average of the second moment gradient and a leaky average of the second
moment of change of parameters in the model.
• In simple terms, AdaDelta adapts learning rate based on a moving window
of gradient update, instead of accumulating all past gradients.
• Here exponentially moving average is used rather than sum of all the
• E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
• We set γ to a similar value as the momentum term, around 0.9.
• Advantages
• • Learning rate doesn't decay
• Disadvantages
• • Computationally Expensive
RMSProp - Root Mean Square Propagation
• RMSProp is one of the version of AdaGrad. It is actually the improvement
of AdaGrad Optimizer.
• • Here the learning rate is an exponential average of the gradients
instead of the cumulative sum of squared gradients.
• RMS-Prop basically combines momentum with AdaGrad.
• Adam can be looked at as a combination of RMSprop and Stochastic
Gradient Descent with momentum.
• It uses the squared gradients to scale the learning rate like RMSprop and it
takes advantage of momentum by using moving average of the gradient .
• • It also introduces two new hyper-parameters beta1 and beta2 which
are usually kept around 0.9 and 0.99 but we can change them according to
our use case. Default value for the learning rate η is 0.001.
• Adam is an adaptive learning rate method, which means, it computes
individual learning rates for different parameters.
• Its name is derived from adaptive moment estimation, and the reason it’s
called that is because Adam uses estimations of first and second moments of
gradient to adapt the learning rate for each weight of the neural network.
• N-th moment of a random variable is defined as the expected value of that
variable to the power of n
L1 regularization (LASSO regression) (Least Absolute Shrinkage and Selection
Operator) produces sparse matrices. Sparse matrices are zero-matrices in which
some elements are ones (the sparsity refers to the ones), but in this context a sparse
matrix could be several close-to-zero values and other larger values. If we find a
model with neurons whose weights are close to zero it means we don’t need those
neurons because the model deactivates them with zeros and we might not need a
specific feature/input leading to a simpler model. For instance, if we have 50
coefficients but only 10 are non-zero, the other 40 are irrelevant to make our
predictions. This is not only interesting from the efficiency point of view but also
from the economic point of view: gathering data and extracting its features might
be a very expensive task (in terms of time and money). Reducing this will benefit
Due to the absolute value, L1 regularization provides with a non-differentiable
term, but despite of that, there are methods to minimize it.
L2 regularization (Ridge regression) on the other hand leads to a balanced
minimization of the weights. Since L2 uses squares, it emphasizes the errors, and it
can be a problem when there are outliers in the data. Unlike L1, L2 has an
analytical solution which makes it computationally efficient.
Both regularizations have a λ parameter which is directly proportional to the
penalty: the larger λ the stronger penalty to find complex models and it will be
more likely that the model will avoid them. Likewise, if λ is zero, regularization is
• Early stopping is an optimization technique used to reduce overfitting
without compromising on model accuracy.
• The main idea behind early stopping is to stop training before a model starts
to overfit.
• It is a set of techniques to artificially increase the dataset by modifying the
copies of existing data or synthetically generating new copies of the dataset
by using the existing dataset.
• Data augmentation is a process of artificially increasing the amount of data
by generating new data points from existing data.
• This includes adding minor alterations to data or using machine learning
models to generate new data points to amplify the dataset.
• Synthetic data: When data is generated artificially without using real-world
images. Synthetic data are often produced by Generative Adversarial
• Augmented data: Derived from original images with some sort of minor
geometric transformations (such as flipping, translation, rotation, or the
addition of noise) in order to increase the diversity of the training set.
• Data augmentation is the process of transforming images to create new ones,
for training machine learning models.
• This is an important step when building datasets because modern machine
learning models are very powerful; if they're given datasets that are too
small, these models can start to ‘overfit’,
• Ensemble learning is a machine learning paradigm where multiple models
(often called “weak learners”) are trained to solve the same problem and
combined to get better results. The main hypothesis is that when weak
models are correctly combined we can obtain more accurate and/or robust
• Weak Learners: A ‘weak learner’ is any ML algorithm (for
regression/classification) that provides an accuracy slightly better than
random guessing.
• In ensemble learning theory, we call weak learners (or base models)
models that can be used as building blocks for designing more complex
models by combining several of them.
• Most of the time, these basics models perform not so well by themselves
either because they have a high bias or because they have too much variance
to be robust.
• Then, the idea of ensemble methods is to try reducing bias and/or variance
of such weak learners by combining several of them together to create
a strong learner (or ensemble model) that achieves better performances.
• Bagging aims to decrease variance, boosting aims to decrease bias, and
stacking aims to improve prediction accuracy.
• Bagging stands for Bootstrap Aggregation.
• Bootstrapping is a technique of sampling different sets of data from a given
training set by using replacement. After bootstrapping the training dataset,
we train the model on all the different sets and aggregate the result. This
technique is known as Bootstrap Aggregation or Bagging.
• Bagging is the type of ensemble technique in which a single training
algorithm is used on different subsets of the training data where the subset
sampling is done with replacement (bootstrap). Once the algorithm is trained
on all the subsets, then bagging predicts by aggregating all the predictions
made by the algorithm on different subsets.
• For aggregating the outputs of base learners, bagging uses majority voting
(most frequent prediction among all predictions) for
classification and averaging (mean of all the predictions) for regression.
• Advantages of a Bagging Model:
• 1. Bagging significantly decreases the variance without increasing bias.
• 2. Bagging methods work so well because of diversity in the training data
since the sampling is done by bootstrapping.
• 3. Also, if the training set is very huge, it can save computational time by
training the model on a relatively smaller data set and still can increase the
accuracy of the model.
• 4. Works well with small datasets as well.
• The term ‘Boosting’ refers to a family of algorithms which converts weak
learner to strong learners. Boosting is an ensemble method for improving the
model predictions of any given learning algorithm. The idea of boosting is to
train weak learners sequentially, each trying to correct its predecessor. The
weak learners are sequentially corrected by their predecessors and, in the
process, they are converted into strong learners.
• Computational scalability,
• · Handles missing values,
• · Robust to outliers,
• · Does not require feature scaling,
• · Can deal with irrelevant inputs,
• · Interpretable (if small),
• Con’s
• · Inability to extract a linear combination of features
• · High variance leading to a small computational power
• Stacking is an ensemble learning method that combines multiple machine
learning algorithms via meta-learning, In which base level algorithms are
trained based on a complete training data-set, the meta-model is trained on
the final outcomes of the all base-level model as a feature.
Advantages of a Stacked Generalization Model:
• Stacking improves the model prediction accuracy.
• The term “dropout” refers to dropping out the nodes (input and hidden layer)
in a neural network .
• All the forward and backwards connections with a dropped node are
temporarily removed, thus creating a new network architecture out of the
parent network.
The nodes are dropped by a dropout probability of p.
• Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully
connected layer. We have a dropout layer with probability p = 0.2 (or keep
probability = 0.8). During the forward propagation (training) from the input
x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4,
5} or {1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.
• For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly
dropped in every iteration (batch).
• By using dropout, in every iteration, we will work on a smaller neural
network than the previous one and therefore, it approaches regularization.
• Dropout helps in shrinking the squared norm of the weights and this tends to
a reduction in overfitting.
• DropConnect works similarly, except that we disable individual weights
(i.e., set them to zero), instead of nodes, so a node can remain partially
• Dropconnect works by randomly setting some of these weights to zero
during training. This has the effect of “dropping out” some of the
connections between neurons.
• One of the most common problems of data science professionals is to avoid
• The solution to such a problem is regularization.
• The regularization techniques help to improve a model and allows it to
converge faster. We have several regularization tools at our end, some of
them are early stopping, dropout, weight initialization techniques, and batch
normalization. The regularization helps in preventing the over-fitting of the
model and the learning process becomes more efficient.
• Normalization is a data pre-processing tool used to bring the numerical data
to a common scale without distorting its shape.
• Batch normalization, it is a process to make neural networks faster and more
stable through adding extra layers in a deep neural network. The new layer
performs the standardizing and normalizing operations on the input of a
layer coming from a previous layer.
• What Is Normalization?
• Normalization is a data preprocessing technique used to adjust the values of
features in a dataset to a common scale. This is done to facilitate data
analysis and modeling, and to reduce the impact of different scales on the
accuracy of machine learning models.
• Normalization is a scaling technique in which values are shifted and rescaled
so that they end up ranging between 0 and 1. It is also known as Min-Max
• Here’s the formula for normalization:
• Normalization equation
• Here, Xmax and Xmin are the maximum and the minimum values of the
feature, respectively.
• When the value of X is the minimum value in the column, the numerator
will be 0, and hence X’ is 0
• On the other hand, when the value of X is the maximum value in the
column, the numerator is equal to the denominator, and thus the value of X’
is 1
• If the value of X is between the minimum and the maximum value, then the
value of X’ is between 0 and 1
• A typical neural network is trained using a collected set of input data
called batch. Similarly, the normalizing process in batch normalization takes
place in batches, not as a single input.
• It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed
• Normalization of the Input
• Normalization is the process of transforming the data to have a mean zero
and standard deviation one. In this step we have our batch input from layer
h, first, we need to calculate the mean of this hidden activation.
• Here, m is the number of neurons at layer h.
• Once we have meant at our end, the next step is to calculate the standard
deviation of the hidden activations
• As we have the mean and the standard deviation ready. We will normalize
the hidden activations using these values. For this, we will subtract the mean
from each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
• The smoothing term(ε) assures numerical stability within the operation by
stopping a division by a zero value.
• Rescaling of Offsetting
• In the final operation, the re-scaling and offsetting of the input take place.
Here two components of the BN algorithm come into the picture, γ(gamma)
and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of
the vector containing values from the previous operations
• These two are learnable parameters, during the training neural network
ensures the optimal values of γ and β are used. That will enable the accurate
normalization of each batch.
• Speed up training