DL Mod2

MODULE 2
TRAINING DEEP MODELS: INTRODUCTION
Deep learning is an advanced form of machine learning that tries to emulate

the way the human brain learns.
In brain, we have nerve cells called neurons, which are connected to one
another by nerve extensions that pass electrochemical signals through the network.
• When the first neuron in the network is stimulated, the input signal is
processed, and if it exceeds a particular threshold, the neuron is activated
and passes the signal on to the neurons to which it is connected.
• These neurons in turn may be activated and pass the signal on through the
rest of the network.
• Over time, the connections between the neurons are strengthened by
frequent use as we learn how to respond effectively.
• Machine learning is concerned with predicting a label based on some
features of a particular observation. In simple terms, a machine learning
model is a function that calculates y (the label) from x (the features): f(x)=y
•
• Because of the layered architecture of the network, this kind of model is
sometimes referred to as a multilayer perceptron.
• Additionally, notice that all neurons in the input and hidden layers are
connected to all neurons in the subsequent layers - this is an example of a
fully connected network.
• While creating a model like this, we must define an input layer that supports
the number of features our model will process, and an output layer that
reflects the number of outputs we expect it to produce.
• We can decide how many hidden layers we want to include and how many
neurons are in each of them;
• but we have no control over the input and output values for these layers -
these are determined by the model training process.
Training a deep neural network

• The training process for a deep neural network consists of multiple
iterations, called epochs.
• For the first epoch, we start by assigning random initialization values for the
weight (w) and bias b values.
• Then the process is as follows:
• Features for data observations with known label values are submitted
to the input layer. Generally, these observations are grouped into
batches (often referred to as mini-batches).
• The neurons then apply their function, and if activated, pass the result
onto the next layer until the output layer produces a prediction.
• The prediction is compared to the actual known value, and the amount
of variance between the predicted and true values (which we call the
loss) is calculated.
• Based on the results, revised values for the weights and bias values are
calculated to reduce the loss, and these adjustments are
backpropagated to the neurons in the network layers.
• The next epoch repeats the batch training forward pass with the
revised weight and bias values, hopefully improving the accuracy of
the model (by reducing the loss).
COST FUNCTION
A cost function is an important parameter that determines how well a machine
learning model performs for a given dataset. It calculates the difference between
the expected value and predicted value and represents it as a single real number.
GRADIENT DESCENT
Gradient Descent is an optimization algorithm which is used for optimizing the
cost function or error in the model.
LOSS FUNCTION
A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data. When training, we
aim to minimize this loss between the predicted and target outputs.
KAIMING AND XAVIER WEIGHT INITIALIZATIONS

• The aim of weight initialization is to prevent layer activation outputs from
exploding or vanishing during the course of a forward pass through a deep
neural network.
• If either occurs, loss gradients will either be too large or too small to flow
backwards beneficially, and the network will take longer to converge, if it is
even able to do so at all.
Different Weight Initialization Techniques

I. Zero Initialization (Initialized all weights to 0)
If we initialized all the weights with 0, then what happens is that the
derivative wrt loss function is the same for every weight in W[l], thus all
weights have the same value in subsequent iterations.
• This makes hidden layers symmetric and this process continues for all the n
iterations. Thus initialized weights with zero make your network no better
than a linear model.
II. Random Initialization (Initialized weights randomly)
– This technique tries to address the problems of zero initialization since

it prevents neurons from learning the same features of their inputs since
our goal is to make each neuron learn different functions of its input and
this technique gives much better accuracy than zero initialization.
– In general, it is used to break the symmetry. It is better to assign
random values except 0 to weights.
– Remember, neural networks are very sensitive and prone to overfitting
as it quickly memorizes the training data.
Best Practices for Weight Initialization
• 👉 Use RELU or leaky RELU as the activation function, as they both are
relatively robust to the vanishing or exploding gradient problems (especially
for networks that are not too deep). In the case of leaky RELU, they never
have zero gradients. Thus they never die and training continues.
• 👉 Use Heuristics for weight initialization: For deep neural networks, we
can use any of the following heuristics to initialize the weights depending on
the chosen non-linear activation function.
• While these heuristics do not completely solve the exploding or vanishing
gradients problems, they help to reduce it to a great extent. The most
common heuristics are as follows:
• 1) Xavier/Glorot Initialization :
• Xavier Initialization is a Gaussian initialization heuristic that keeps the
variance of the input to a layer the same as that of the output of the layer.
This ensures that the variance remains the same throughout the network.
• It works well for sigmoid function
• i)Xavier Normal :
• Normal Distribution with Mean=0
Wij~N(0,std) where std=sqrt(2/(fan_in + fan_out))
• Here N is a Normal Distribution.
• ii)Xavier Uniform :
• Wij ~ D [-sqrt(6)/sqrt(fan_in+fan_out),sqrt(6)/sqrt(fan_in + fan_out)]
• Where D is a Uniform Distribution
• 2)He Init[Kaiming Initialization] :
• This weight initialization also has two variations. It works pretty well for
ReLU and LeakyReLU activation function.
• i)He Normal :
• Normal Distribution with Mean=0
Wij ~ N(mean,std) , mean=0 , std=sqrt(2/fan_in)
• Where N is a Normal Distribution
• ii)He Uniform :
• Wij ~ D[-sqrt(6/fan_in),sqrt(6/fan_in)]
• Where D is a Uniform Distribution
• Benefits of using these heuristics:
• All these heuristics serve as good starting points for weight initialization and
they reduce the chances of exploding or vanishing gradients.
• All these heuristics do not vanish or explode too quickly, as the weights are
neither too much bigger than 1 nor too much less than 1.
• They help to avoid slow convergence and ensure that we do not keep
oscillating off the minima.
Vanishing and exploding gradient problem

In a network of n hidden layers, n derivatives will be multiplied together.
If the derivatives are large then the gradient will increase exponentially as
we propagate down the model until they eventually explode, and this is what
we call the problem of exploding gradient.
Alternatively, if the derivatives are small then the gradient will decrease
exponentially as we propagate through the model until it eventually
vanishes, and this is the vanishing gradient problem.
In the case of exploding gradients, the accumulation of large derivatives
results in the model being very unstable and incapable of effective learning,
On the other hand, the accumulation of small gradients results in a model
that is incapable of learning meaningful insights since the weights and biases
of the initial layers, which tends to learn the core features from the input data
(X), will not be updated effectively. In the worst case scenario the gradient
will be 0 which in turn will stop the network from further training.
OPTIMIZATION TECHNIQUES
• Optimization algorithms are responsible for reducing losses and provide
most accurate results possible.
• The weight is initialized using some initialization strategies and is updated
with each epoch according to the equation.
• The best results are achieved using some optimization strategies or
algorithms called Optimizer.
• when we get to realize that our model is performing poor at the current
instance so we need to minimize the loss and maximize the accuracy. That
process is known as optimization.
• Optimizers are methods or algorithms used to change the attributes of neural
network such as weights and learning rate to reduce the loss.
• After calculation of loss we need to optimize our weights and bias in the
same iteration.
• Initially we don't know the weights so we start randomly but with some trial
and error based on loss function we can end up getting our loss downwards.
• Optimization techniques are responsible for reduing the loss and provide
most accurate results possible.
• Some of the techniques are
• 1· Gradient Descent
• 2 · Stochastic Gradient Descent (SGD)

3. Mini-Batch Stochastic Gradient Descent (MB — SGD)
4. SGD with Momentum
5.Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7.AdaDelta
8. RMSProp
9. Adam
10.Nadam
Gradient Descent
• Gradient Descent is one of the popular techniques to perform optimization.
• It's based on a convex function and tweaks its parameters iteratively to
minimize a given function to its local minimum.
• Gradient Descent is an optimization algorithm for finding a local minimum
of a differentiable function.
• We start by defining initial parameter's values and from there gradient
descent uses calculus to iteratively adjust the values so they minimize the
given cost-function.
• The above equation computes the gradient of the cost function J(θ) w.r.t. to
the parameters/weights θ for the entire training dataset:
• "A gradient measures how much the output of a function changes if you
change the inputs a little bit."
• Learning rate determines how big the steps are gradient descent takes into
the direction of local minimum. That will tells us about how fast or slow we
will move towards the optimal weights.
• When we initialize learning rate we set an apporpriate value which is neither
too low nor too high.
•
• Advantages of Gradient Descent
• Easy Computation
• Easy to implement
• Easy to understand
• Disadvantages of Gradient Descent
• May trap at local minima
• Weights are changed after calculation the gradient on whole dataset,
so if dataset is too large then it may take years to converge to the
minima
• Requires large memory to calculate gradient for whole dataset
• 3 Types of Gradient Descent
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini Batch Gradient Descent
• Batch Gradient Descent
• In batch gradient descent we uses the entire dataset to calculate gradient of
the cost function for each epoch.
• That's why the convergence is slow in batch gradient descent.
• SGD - Stochastic Gradient Descent
• SGD algorithm is an extension of the Gradient Descent and it overcomes
disadvantages of gradient descent algorithm.
• SGD derivative is computed taking one observation at a time. So if dataset
contains 100 observations then it updates model weigts and bias 100 times in
1 epoch.
SGD performs parameter updation for each training example x(i)
and label y(i):
• Advantages of SGD
• Memory requirement is less compared to Gradient Descent algorithm.
• Disadvantages of SGD
• May stuck at local minima
• Time taken by 1 epoch is large compared to Gradient Descent
• MBGD - Mini Batch Gradient Descent

• MBGD is combination of both batch and stochastic gradient descent. It
divides the training data into small batch size and performs updates on each
of the batch.
• So here only subset of dataset is used for calculating the loss function.
• Mini Batch is widely used and converges faster because it requires less
cycles in one iteration.
•
• Advantages of Mini Batch Gradient Descent
• Less time taken to converge the model
• Requires medium amount of memory
• Frequently updates the model parameters and also has less variance.
• Disadvantages of Mini Batch Gradient Descent
• If the learning rate is too small then convergence rate will be slow.
• It doesn't guarantee good convergence
Problems with Gradient Descent
• There are a few problems that can occur when using gradient descent:
• I. Local Minima:
• Gradient descent can get stuck in local minima, points that are not the global
minimum of the cost function but are still lower than the surrounding points.
This can occur when the cost function has multiple valleys, and the
algorithm gets stuck in one instead of reaching the global minimum.
• II. Saddle Points:
• A saddle point is a point in the cost function where one dimension has a
higher value than the surrounding points, and the other has a lower value.
• Gradient descent can get stuck at these points because the gradients in one
direction point towards a lower value, while those in the other direction
point towards a higher value.
• III. Plateaus:
• A plateau is a region in the cost function where the gradients are very small
or close to zero. This can cause gradient descent to take a long time or not
converge.
• IV. Oscillations:
• Oscillations occur when the learning rate is too high, causing the algorithm
to overshoot the minimum and oscillate back and forth.
• V. Slow convergence:
• Gradient descent can converge very slowly when the cost function is
complex or has many local minima. This means the algorithm may take a
long time to find the global minimum.
•
•
• VI. Stochasticity:
• In stochastic gradient descent, the cost function is evaluated at random

samples from the data set. This introduces randomness into the algorithm,
making converging to a global minimum more difficult.
.
• VII. Vanishing or Exploding Gradients:
• Deep neural networks with many layers can suffer from vanishing or
exploding gradients. This occurs when the gradients become very small or
large, respectively, as they are backpropagated through the layers. This can
make it difficult for the algorithm to update the weights and biases.
MOMENTUM
• Momentum helps to,
• Escape local minima and saddle points
• Aids in faster convergence by reducing oscillations
• Smooths out weight updates for stability
• Reduces model complexity and prevents overfitting
• Can be used in combination with other optimization algorithms for improved
performance.
• Momentum was introduced for reducing high variance in SGD.
• Instead of depending only on current gradient to update weight, gradient
descent with momentum replaces the current gradient with V (which stands
for velocity), the exponential moving average of current and past gradients.
• Momentum simulates the inertia of an object when it is moving, the
direction of the previous update is retained to a certain extent during the
update, while the current update gradient is used to fine-tune the final update
direction.
• One more hyperparameter is used in this method known as momentum
symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
Now, the weights are updated by θ=θ−V(t).

• The momentum term γ is usually set to 0.9 or a similar value.
• Momentum at time ‘t’ is computed using all previous updates giving more
weightage to recent updates compared to the previous update. This leads to
speed up the convergence.
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way .
• The same thing happens to our parameter updates: The momentum term
increases for dimensions whose gradients point in the same directions and
reduces updates for dimensions whose gradients change directions.
As a result, we gain faster convergence and reduced oscillation
• Advantages
• Converges faster than SGD
• All advantages of SGD
• Reduces the oscillations and high variance of the parameters
• Disadvantage
• One more extra variable is introduced that we need to compute for each
update
NAG - Nesterov Accelerated Gradient

• Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up.
• The approach followed here was that the parameters update would be made
with the history element first and then only the derivative is calculated
which can move it in the forward or backward direction.
• This is called the look-ahead approach, and it makes more sense because if
the curve reaches near to the minima, then the derivative can make it move
slowly so that there are fewer oscillations and therefore saving more time.
• We know we’ll be using γ.V(t−1) for modifying the weights so, θ−γV(t−1)
approximately tells us the future location.
• Now, we’ll calculate the cost based on this future parameter rather than the
current one.
and then update the parameters using θ = θ − V(t)
Both NAG and SGD with momentum algorithms work equally well and share the
same advantages and disadvantages.
AdaGrad - Adaptive Gradient Descent

• AdaGrad is little bit different from other gradient descent algorithms. In all
the previously discussed algorithms learning rate was constant. So here the
key idea is to have an adaptive learning for each of the weights.
• • It uses different learning rate for each iteration. The more the
parameters get change, the more minor the learning rate changes.
• • Lot of times we have sparse as well as dense dataset, but we keep our
learning rate constant for all the iterations.
Here the alpha (t) denotes the different learning rates at each iteration, n is a
constant, and E is a small positive to avoid division by 0
Advantages
● Learning rate changes adaptively, no human intervention is required
● One of the best algorithm to train on sparse data
Disadvantages
● Learning rate is always decreasing which leads to slow convergence
● Due to small learning rate model eventually becomes unable to train
properly and couldn't acquire the required knowledge and hence accuracy of
the model is compromised.
AdaDelta
• AdaDelta is an extension of AdaGrad. In AdaGrad learning rate became too
small that it might decay or after some time it approaches zero.
• AdaDelta was introduced to get rid of learning rate decaying problem.
• To deal with these problems, AdaDelta uses two state variables to store the
leaky average of the second moment gradient and a leaky average of the second
moment of change of parameters in the model.
• In simple terms, AdaDelta adapts learning rate based on a moving window
of gradient update, instead of accumulating all past gradients.
• Here exponentially moving average is used rather than sum of all the
gradients.
• E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
• We set γ to a similar value as the momentum term, around 0.9.
• Advantages
• • Learning rate doesn't decay
• Disadvantages
• • Computationally Expensive
RMSProp - Root Mean Square Propagation
• RMSProp is one of the version of AdaGrad. It is actually the improvement
of AdaGrad Optimizer.
• • Here the learning rate is an exponential average of the gradients
instead of the cumulative sum of squared gradients.
• RMS-Prop basically combines momentum with AdaGrad.
● RMSprop as well divides the learning rate by an exponentially decaying

average of squared gradients.
● 0.9—>Momentum
● (1-Momentum)=1-0.9=0.1
ADAM
• Adam can be looked at as a combination of RMSprop and Stochastic
Gradient Descent with momentum.
• It uses the squared gradients to scale the learning rate like RMSprop and it
takes advantage of momentum by using moving average of the gradient .
• • It also introduces two new hyper-parameters beta1 and beta2 which
are usually kept around 0.9 and 0.99 but we can change them according to
our use case. Default value for the learning rate η is 0.001.
• Adam is an adaptive learning rate method, which means, it computes
individual learning rates for different parameters.
• Its name is derived from adaptive moment estimation, and the reason it’s
called that is because Adam uses estimations of first and second moments of
gradient to adapt the learning rate for each weight of the neural network.
• N-th moment of a random variable is defined as the expected value of that
variable to the power of n
m — moment, X -random variable.

• The first moment is mean, and the second moment is uncentered variance
(meaning we don’t subtract the mean during variance calculation).
🡪To estimates the moments, Adam utilizes exponentially moving averages,
computed on the gradient evaluated on a current mini-batch.
mhat_t = m_t / (1 - β1^t)

vhat_t = v_t / (1 - β2^t)
theta_t = theta_{t-1} - α * mhat_t / (sqrt(vhat_t) + ε)
• where g_t is the gradient at time t, m_t and v_t are the first and second
moments of the gradients, respectively, beta1 and beta2 are hyperparameters
that control the decay rates of the moment estimates, have really good
default values of 0.9 and 0.999 respectively, alpha is the learning rate, and
epsilon is a small constant used to prevent division by zero.
• The only thing left to do is to use those moving averages to scale learning
rate individually for each parameter. The way it’s done in Adam is very
simple, to perform weight update we do the following:
•
• What Are the Advantages of Adam Optimization?
• Adam optimization offers several advantages over other optimization
algorithms:
1. Adaptive Learning Rates: Unlike fixed learning rate methods like SGD,
Adam optimization provides adaptive learning rates for each parameter
based on the history of gradients. This allows the optimizer to converge
faster and more accurately, especially in high-dimensional parameter spaces.
2. Momentum: Adam optimization uses momentum to smooth out fluctuations
in the optimization process, which can help the optimizer avoid local
minima and saddle points.
3. Bias Correction: Adam optimization applies bias correction to the first and
second moment estimates to ensure that they are unbiased estimates of the
true values.
4. Robustness: Adam optimization is relatively robust to hyperparameter
choices and works well across a wide range of deep learning architectures
Best Practices for Using Adam Optimization
…Use Default Hyperparameters: In most cases, the default
hyperparameters for Adam optimization (beta1=0.9, beta2=0.999,
epsilon=1e-8) work well and do not need to be tuned.
..Monitor Learning Rate: It can be helpful to monitor the learning rate
during training to ensure that it is not too high or too low. A good rule of
thumb is to set the initial learning rate to a small value and then gradually
increase it until convergence.
..Regularization: Adam optimization can benefit from regularization
techniques like weight decay or dropout to prevent overfitting.
..Batch Size: The batch size can have an impact on the performance of
Adam optimization. In general, larger batch sizes tend to work better with
Adam optimization compared to other optimization algorithms.
CONCEPT OF REGULARIZATION
• Regularization is a set of techniques that can prevent overfitting in neural
networks and thus improve the accuracy of a Deep Learning model when
facing completely new data from the problem domain.
• Overfitting refers to the phenomenon where a neural network models the
training data very well but fails when it sees new data from the same
problem domain.
• Overfitting is caused by noise in the training data that the neural network
picks up during training and learns it as an underlying concept of the data.
• Weight regularization is a technique which aims to stabilize an overfitted
network by penalizing the large value of weights in the network.
• An overfitted network usually presents with problems with a large value of
weights, as a small change in the input can lead to large changes in the
output.
• For instance, when the network is given new or test data, it results in
incorrect predictions.
• Weight regularization penalizes the network’s large weights & forcing the
optimization algorithm to reduce the larger weight values to smaller weights,
and this leads to stability of the network & presents good performance.
• In weight regularization, the network configuration remains unchanged only
modifying the value of weights.
• Weight Regularization reduces overfitting by penalizing or adding a
constraint to the loss function.
• In Deep Learning there are two well-known regularization techniques:
• L1 and L2 regularization.
• Both add a penalty to the cost based on the model complexity, so instead of
calculating the cost by simply using a loss function, there will be an
additional element (called “regularization term”) that will be added in order
to penalize complex models.
•
L1 regularization (LASSO regression) (Least Absolute Shrinkage and Selection
Operator) produces sparse matrices. Sparse matrices are zero-matrices in which
some elements are ones (the sparsity refers to the ones), but in this context a sparse
matrix could be several close-to-zero values and other larger values. If we find a
model with neurons whose weights are close to zero it means we don’t need those
neurons because the model deactivates them with zeros and we might not need a
specific feature/input leading to a simpler model. For instance, if we have 50
coefficients but only 10 are non-zero, the other 40 are irrelevant to make our
predictions. This is not only interesting from the efficiency point of view but also
from the economic point of view: gathering data and extracting its features might
be a very expensive task (in terms of time and money). Reducing this will benefit
us.
Due to the absolute value, L1 regularization provides with a non-differentiable
term, but despite of that, there are methods to minimize it.
L2 regularization (Ridge regression) on the other hand leads to a balanced
minimization of the weights. Since L2 uses squares, it emphasizes the errors, and it
can be a problem when there are outliers in the data. Unlike L1, L2 has an
analytical solution which makes it computationally efficient.
Both regularizations have a λ parameter which is directly proportional to the
penalty: the larger λ the stronger penalty to find complex models and it will be
more likely that the model will avoid them. Likewise, if λ is zero, regularization is
deactivated.
EARLY STOPPING
• Early stopping is an optimization technique used to reduce overfitting
without compromising on model accuracy.
• The main idea behind early stopping is to stop training before a model starts
to overfit.
Early stopping approaches

• 1. Training model on a preset number of epochs
• This method is a simple.
• By running a set number of epochs, we run the risk of not reaching a
satisfactory training point.
• With a higher learning rate, the model might possibly converge with fewer
epochs, but this method requires a lot of trial and error.
• Due to the advancements in machine learning, this method is pretty
obsolete.
• 2. Stop when the loss function update becomes small
• This approach is more sophisticated than the first as it is built on the fact that
the weight updates in gradient descent become significantly smaller as the
model approaches minima.
• Usually, the training is stopped when the update becomes as small as 0.001,
as stopping at this point minimizes loss and saves computing power by
preventing any unnecessary epochs.
• However, overfitting might still occur.
• 3. Validation set strategy
• This clever technique is the most popular early stopping approach. To
understand how it works, it’s important to look at how training and
validation errors change with the number of epochs (as in the figure above).
• The training error decreases exponentially until increasing epochs no longer
have such a large effect on the error.
• The validation error, however, initially decreases with increasing epochs, but
after a certain point, it starts increasing.
• This is the point where a model should be early stopped as beyond this the
model will start to overfit.
• Benefits of Early Stopping:

• Helps in reducing overfitting
• It improves generalisation
• It requires less amount of training data
• Takes less time compared to other regularisation models
• It is simple to implement
Limitations of Early Stopping:
• If the model stops too early, there might be risk of underfitting
• It may not be beneficial for all types of models
• If validation set is not chosen properly, it may not lead to the most optimal
stopping
• To summarize, early stopping can be best used to prevent overfitting of the
model, and saving resources. It would give best results if taken care of few
things like – parameter tuning, preventing the model from overfitting, and
ensuring that the model learns enough from the data.
DATASET AUGMENTATION
• It is a set of techniques to artificially increase the dataset by modifying the
copies of existing data or synthetically generating new copies of the dataset
by using the existing dataset.
• Data augmentation is a process of artificially increasing the amount of data
by generating new data points from existing data.
• This includes adding minor alterations to data or using machine learning
models to generate new data points to amplify the dataset.
• Synthetic data: When data is generated artificially without using real-world
images. Synthetic data are often produced by Generative Adversarial
Networks
• Augmented data: Derived from original images with some sort of minor
geometric transformations (such as flipping, translation, rotation, or the
addition of noise) in order to increase the diversity of the training set.
• Data augmentation is the process of transforming images to create new ones,
for training machine learning models.
• This is an important step when building datasets because modern machine
learning models are very powerful; if they're given datasets that are too
small, these models can start to ‘overfit’,
Common Data Augmentation Techniques

• I.Spatial Transformation
• With spatial transformation techniques, pixels are moved around the image
in set ways to create the augmented image.
• Flipping
• This is a very simple technique in which an image is flipped horizontally to
produce a mirror image or flipped vertically to produce an image that is
upside down.
• Rotation
• With this technique, we rotate the entire image by a certain degree.
• Translation
• The entire image is shifted left/right and/or up/down by a certain amount.
This will result in objects of interest appearing in different locations of the
image frame after translation is applied.
• Cropping
• Given an image, we select part of the image (normally a square or
rectangular section), take a crop of this selection, and then resize the crop to
the original size of the image.
• II.Colour Transformation
• With colour transformation techniques, the spatial aspect of the image is
normally preserved while the values of the pixels are edited.
• Brightness
• The pixel values of the image are either increased to result in a lighter,
brighter image or reduced to result in a darker, dimmer image.
• Contrast
• Contrast is the difference between the bright and dark parts of an image.
Increasing the contrast generally involves making the bright parts of the
image brighter and the dark parts darker.
• III.Advanced Data Augmentation Techniques
• GridMask
• Unlike the above techniques of spatial and colour transformations,
GridMask falls under a third set of transformation techniques which we can
refer to as 'information deletion’.
• With these techniques, parts of the image are removed by setting the pixels
to 0 or placing some patch over that part of the image, thereby deleting
information.
• IV..Niche Data Augmentation Techniques
• Temporal Reordering
• Most of the techniques we've talked about above work well on single
images. However given that we work with multiple images from a camera,
we can try and incorporate other augmentation techniques specific to video
data. With temporal reordering, given a pair of images, we can reverse the
order of the images and present this to the model as a different training
example.
•
BENEFITS OF DATA AUGMENTATION

• Improving model prediction accuracy
• adding more training data into the models
• preventing data scarcity for better models
• reducing data overfitting ( i.e. an error in statistics, it means a function
corresponds too closely to a limited set of data points) and creating
variability in data
• increasing generalization ability of the models
• Reducing costs of collecting and labeling data
• Enables rare event prediction
PARAMETER TYING AND SHARING

• Parameter sharing and parameter tying is another well-known approach for
controlling the complexity of Deep Neural Networks by forcing certain
weights to share the same value.
• L2 regularization (or weight decay) penalizes model parameters for
deviating from fixed value of zero •
• • We may know from domain and model architecture that there should be
some dependencies between model parameters
• Model Parameters:These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.
• Hyperparameters: These are adjustable parameters that must be tuned in
order to obtain a model with optimal performance.
• We want to express that certain parameters should be close to one another
• Approach was used for regularizing the parameters of one model, trained as
a supervised classifier, to be close to the parameters of another model,
trained in an unsupervised paradigm (to capture the distribution of the input
data)
• – Ex. of unsupervised learning: k-means clustering
• • Input x is mapped to a one-hot vector h. If x belongs to cluster i then hi=1
and rest are zero corresponding to its cluster
• Parameter sharing is where we force sets of parameters to be equal
• Because we interpret various models or model components as sharing a
unique set of parameters
• • Only a subset of the parameters needs to be stored in memory
• In a CNN significant reduction in the memory of the model
Use of parameter sharing in CNNs

• Most extensive use of parameter sharing is in convolutional neural networks
(CNNs) •
• Natural images have many statistical properties that are invariant to
translation –
• Ex: photo of a cat remains a photo of a cat if it is translated one pixel
to the right – CNNs take this property into account by sharing
parameters across multiple image locations – Thus we can find a cat
with the same cat detector whether the cat appears at column i or
column i+1 in the image
•
ENSEMBLE METHODS
• Ensemble learning is a machine learning paradigm where multiple models
(often called “weak learners”) are trained to solve the same problem and
combined to get better results. The main hypothesis is that when weak
models are correctly combined we can obtain more accurate and/or robust
models.
• Weak Learners: A ‘weak learner’ is any ML algorithm (for
regression/classification) that provides an accuracy slightly better than
random guessing.
• In ensemble learning theory, we call weak learners (or base models)
models that can be used as building blocks for designing more complex
models by combining several of them.
• Most of the time, these basics models perform not so well by themselves
either because they have a high bias or because they have too much variance
to be robust.
• Then, the idea of ensemble methods is to try reducing bias and/or variance
of such weak learners by combining several of them together to create
a strong learner (or ensemble model) that achieves better performances.
• ENSEMBLE METHODS
• BAGGING
• BOOSTING
• STACKING
• Bagging aims to decrease variance, boosting aims to decrease bias, and
stacking aims to improve prediction accuracy.
• BAGGING
• Bagging stands for Bootstrap Aggregation.
• Bootstrapping is a technique of sampling different sets of data from a given
training set by using replacement. After bootstrapping the training dataset,
we train the model on all the different sets and aggregate the result. This
technique is known as Bootstrap Aggregation or Bagging.
• Bagging is the type of ensemble technique in which a single training
algorithm is used on different subsets of the training data where the subset
sampling is done with replacement (bootstrap). Once the algorithm is trained
on all the subsets, then bagging predicts by aggregating all the predictions
made by the algorithm on different subsets.
• For aggregating the outputs of base learners, bagging uses majority voting
(most frequent prediction among all predictions) for
classification and averaging (mean of all the predictions) for regression.
• Advantages of a Bagging Model:
• 1. Bagging significantly decreases the variance without increasing bias.
• 2. Bagging methods work so well because of diversity in the training data
since the sampling is done by bootstrapping.
• 3. Also, if the training set is very huge, it can save computational time by
training the model on a relatively smaller data set and still can increase the
accuracy of the model.
• 4. Works well with small datasets as well.
• BOOSTING
• The term ‘Boosting’ refers to a family of algorithms which converts weak
learner to strong learners. Boosting is an ensemble method for improving the
model predictions of any given learning algorithm. The idea of boosting is to
train weak learners sequentially, each trying to correct its predecessor. The
weak learners are sequentially corrected by their predecessors and, in the
process, they are converted into strong learners.
PROS
• Computational scalability,
• · Handles missing values,
• · Robust to outliers,
• · Does not require feature scaling,
• · Can deal with irrelevant inputs,
• · Interpretable (if small),
• Con’s
• · Inability to extract a linear combination of features
• · High variance leading to a small computational power
STACKING
• Stacking is an ensemble learning method that combines multiple machine
learning algorithms via meta-learning, In which base level algorithms are
trained based on a complete training data-set, the meta-model is trained on
the final outcomes of the all base-level model as a feature.
Advantages of a Stacked Generalization Model:
• Stacking improves the model prediction accuracy.
Disadvantage of a Stacked Generalization Model:

• 1. As we are taking the whole dataset for training for every individual
classifier, in the case of huge datasets the computational time will be more as
each classifier is working independently on the huge dataset.
DROPOUT
• The term “dropout” refers to dropping out the nodes (input and hidden layer)
in a neural network .
• All the forward and backwards connections with a dropped node are
temporarily removed, thus creating a new network architecture out of the
parent network.
The nodes are dropped by a dropout probability of p.
• Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully
connected layer. We have a dropout layer with probability p = 0.2 (or keep
probability = 0.8). During the forward propagation (training) from the input
x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4,
5} or {1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.
• For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly
dropped in every iteration (batch).
• By using dropout, in every iteration, we will work on a smaller neural
network than the previous one and therefore, it approaches regularization.
• Dropout helps in shrinking the squared norm of the weights and this tends to
a reduction in overfitting.
DROPCONNECT
• DropConnect works similarly, except that we disable individual weights
(i.e., set them to zero), instead of nodes, so a node can remain partially
active.
• Dropconnect works by randomly setting some of these weights to zero
during training. This has the effect of “dropping out” some of the
connections between neurons.
•
BATCH NORMALIZATION
• One of the most common problems of data science professionals is to avoid
over-fitting.
• The solution to such a problem is regularization.
• The regularization techniques help to improve a model and allows it to
converge faster. We have several regularization tools at our end, some of
them are early stopping, dropout, weight initialization techniques, and batch
normalization. The regularization helps in preventing the over-fitting of the
model and the learning process becomes more efficient.
• Normalization is a data pre-processing tool used to bring the numerical data
to a common scale without distorting its shape.
• Batch normalization, it is a process to make neural networks faster and more
stable through adding extra layers in a deep neural network. The new layer
performs the standardizing and normalizing operations on the input of a
layer coming from a previous layer.
• What Is Normalization?
• Normalization is a data preprocessing technique used to adjust the values of
features in a dataset to a common scale. This is done to facilitate data
analysis and modeling, and to reduce the impact of different scales on the
accuracy of machine learning models.
• Normalization is a scaling technique in which values are shifted and rescaled
so that they end up ranging between 0 and 1. It is also known as Min-Max
scaling.
• Here’s the formula for normalization:
• Normalization equation
• Here, Xmax and Xmin are the maximum and the minimum values of the
feature, respectively.
• When the value of X is the minimum value in the column, the numerator
will be 0, and hence X’ is 0
• On the other hand, when the value of X is the maximum value in the
column, the numerator is equal to the denominator, and thus the value of X’
is 1
• If the value of X is between the minimum and the maximum value, then the
value of X’ is between 0 and 1
• A typical neural network is trained using a collected set of input data
called batch. Similarly, the normalizing process in batch normalization takes
place in batches, not as a single input.
• BATCH NORMALIZATION
• It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed
• Normalization of the Input
• Normalization is the process of transforming the data to have a mean zero
and standard deviation one. In this step we have our batch input from layer
h, first, we need to calculate the mean of this hidden activation.
•
• Here, m is the number of neurons at layer h.
• Once we have meant at our end, the next step is to calculate the standard
deviation of the hidden activations
• As we have the mean and the standard deviation ready. We will normalize
the hidden activations using these values. For this, we will subtract the mean
from each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
• The smoothing term(ε) assures numerical stability within the operation by
stopping a division by a zero value.
• Rescaling of Offsetting
• In the final operation, the re-scaling and offsetting of the input take place.
Here two components of the BN algorithm come into the picture, γ(gamma)
and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of
the vector containing values from the previous operations
• These two are learnable parameters, during the training neural network
ensures the optimal values of γ and β are used. That will enable the accurate
normalization of each batch.
• ADVANTAGE
• Speed up training

DL Mod2

Uploaded by

Copyright:

Available Formats

DL Mod2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Mod2

Uploaded by

Copyright:

Available Formats

MODULE 2

TRAINING DEEP MODELS: INTRODUCTION

Deep learning is an advanced form of machine learning that tries to emulate

Training a deep neural network

KAIMING AND XAVIER WEIGHT INITIALIZATIONS

Different Weight Initialization Techniques

– This technique tries to address the problems of zero initialization since

Vanishing and exploding gradient problem

• 2 · Stochastic Gradient Descent (SGD)

• MBGD - Mini Batch Gradient Descent

• In stochastic gradient descent, the cost function is evaluated at random

• VII. Vanishing or Exploding Gradients:

Now, the weights are updated by θ=θ−V(t).

NAG - Nesterov Accelerated Gradient

AdaGrad - Adaptive Gradient Descent

● RMSprop as well divides the learning rate by an exponentially decaying

m — moment, X -random variable.

mhat_t = m_t / (1 - β1^t)

Early stopping approaches

• Benefits of Early Stopping:

Common Data Augmentation Techniques

BENEFITS OF DATA AUGMENTATION

PARAMETER TYING AND SHARING

Use of parameter sharing in CNNs

Disadvantage of a Stacked Generalization Model:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.