0% found this document useful (0 votes)
39 views73 pages

NNDL Notes

The document discusses various regularization techniques in deep learning to prevent overfitting and improve model performance, including L1 and L2 regularization, dropout, early stopping, and data augmentation. It explains the concepts of underfitting and overfitting, along with their causes and solutions, emphasizing the balance needed for a good fit in statistical models. Additionally, it covers the importance of bias and variance in machine learning, and how different regularization methods can help in achieving better generalization on unseen data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views73 pages

NNDL Notes

The document discusses various regularization techniques in deep learning to prevent overfitting and improve model performance, including L1 and L2 regularization, dropout, early stopping, and data augmentation. It explains the concepts of underfitting and overfitting, along with their causes and solutions, emphasizing the balance needed for a good fit in statistical models. Additionally, it covers the importance of bias and variance in machine learning, and how different regularization methods can help in achieving better generalization on unseen data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Unit 4

Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as


Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early
Stopping, Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and
other Ensemble Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop
and Manifold, Tangent Classifier

Regularization is a set of techniques that can prevent overfitting in neural networks and thus
improve the accuracy of a Deep Learning model when facing completely new data from the
problem domain.

Regularization is a technique used in machine learning and deep learning to prevent

overfitting and improve the generalization performance of a model. It involves adding a

penalty term to the loss function during training.

This penalty discourages the model from becoming too complex or having large parameter

values, which helps in controlling the model’s ability to fit noise in the training data.

Regularization methods include L1 and L2 regularization, dropout, early stopping, and more.

By applying regularization, models become more robust and better at making accurate

predictions on unseen data.

Underfitting in Machine Learning


A statistical model or a machine learning algorithm is said to have underfitting when a
model is too simple to capture data complexities. It represents the inability of the model
to learn the training data effectively result in poor performance both on the training and
testing data. In simple terms, an underfit model’s are inaccurate, especially when applied
to new, unseen examples. It mainly happens when we uses very simple model with overly
simplified assumptions. To address underfitting problem of the model, we need to use
more complex models, with enhanced feature representation, and less regularization.
Note: The underfitting model has High bias and low variance.

Bias and Variance in Machine Learning


 Bias: Bias refers to the error due to overly simplistic assumptions in the learning
algorithm. These assumptions make the model easier to comprehend and learn but
might not capture the underlying complexities of the data. It is the error due to the
model’s inability to represent the true relationship between input and output
accurately. When a model has poor performance both on the training and testing data
means high bias because of the simple model, indicating underfitting.
 Variance: Variance, on the other hand, is the error due to the model’s sensitivity to
fluctuations in the training data. It’s the variability of the model’s predictions for
different instances of training data. High variance occurs when a model learns the
training data’s noise and random fluctuations rather than the underlying pattern. As a
result, the model performs well on the training data but poorly on the testing data,
indicating overfitting.

Reasons for Underfitting


1. The model is too simple, So it may be not capable to represent the complexities in the
data.
2. The input features which is used to train the model is not the adequate
representations of underlying factors influencing the target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularization are used to prevent the overfitting, which constraint the
model to capture the data well.

Techniques to Reduce Underfitting


1. Increase model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better
results.

Example An epoch is when all the training data is used at once and is defined as the total

number of iterations of all the training data in one cycle for training the machine learning

model. Another way to define an epoch is the number of passes a training dataset takes

around an algorithm.

Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate

predictions on testing data. When a model gets trained with so much data, it starts learning
results in High variance. Then the model does not categorize the data correctly, because of

too many details and noise. The causes of overfitting are the non-parametric and non-linear

methods because these types of machine learning algorithms have more freedom in

building the model based on the dataset and therefore they can really build unrealistic

models. A solution to avoid overfitting is using a linear algorithm if we have linear data or

using the parameters like the maximal depth if we are using decision trees.

Reasons for Overfitting:


1. High variance and low bias.
2. The model is too complex.
3. The size of the training data.
Techniques to Reduce Overfitting
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization.
5. Use dropout for neural networks to tackle overfitting.

Good Fit in a Statistical Model


Ideally, the case when the model makes the predictions with 0 error, is said to have a
good fit on the data. This situation is achievable at a spot between overfitting and
underfitting. In order to understand it, we will have to look at the performance of our
model with the passage of time, while it is learning from the training dataset.
With the passage of time, our model will keep on learning, and thus the error for the
model on the training and testing data will keep on decreasing. If it will learn for too long,
the model will become more prone to overfitting due to the presence of noise and less
useful details. Hence the performance of our model will decrease. In order to get a good
fit, we will stop at a point just before where the error starts increasing. At this point, the
model is said to have good skills in training datasets as well as our unseen testing dataset.

Parameter norm Penalties

Parameter Norm Penalties are regularization methods that apply a penalty to the norm of
parameters in the objective function of a neural network.

Different Regularization Techniques in Deep Learning

Now that we have an understanding of how regularization helps in reducing overfitting,


we’ll learn a few different techniques in order to apply regularization in deep learning.

Lasso Regression

A regression model which uses the L1 Regularization technique is called LASSO(Least


Absolute Shrinkage and Selection Operator) regression. Lasso Regression adds
the “absolute value of magnitude” of the coefficient as a penalty term to the loss
function(L). Lasso regression also helps us achieve feature selection by penalizing the
weights to approximately equal to zero if that feature does not serve any purpose in the
model.

where,
 m – Number of Features
 n – Number of Examples
 y_i – Actual Target Value
 y_i(hat) – Predicted Target Value

Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty
term to the loss function(L).

L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.

Cost function = Loss (say, binary cross entropy) + Regularization term


Due to the addition of this regularization term, the values of weight matrices decrease
because it assumes that a neural network with smaller weight matrices leads to simpler
models. Therefore, it will also reduce overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose value is


optimized for better results. L2 regularization is also known as weight decay as it forces the
weights to decay towards zero (but not exactly zero).

In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our model.
Otherwise, we usually prefer L2 over it.

In keras, we can directly apply regularization to any layer using the regularizers. Below I
have applied regularizer on dense layer having 500 neurons and relu activation function.

In [11]:
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Flatten())
#l2 regularizer
model.add(Dense(500,kernel_regularizer=regularizers.l2(0.01),activation="relu"))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda, which we
need to optimize further
Similarly, we can also apply L1 regularization.

2. Norm penalties as constrained optimization

we can construct a generalized Lagrangian function containing the objective function along

with the penalties can be increased or decreased. Suppose we wanted Ω(θ) < k, then we

could construct the following Lagrangian equation proposed by author:


We get optimal θ by solving the Lagrangian. If Ω(θ) > k, then the weights need to be

compensated highly and hence, α should be large to reduce its value below k.

Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α should be

small. This is now similar to the parameter norm penalty regularized objective function as

both of them encourage lower values of the norm. Thus, parameter norm penalties naturally

impose a constraint, like the L²-regularization, defining a constrained L²-ball.

Larger α implies a smaller constrained region as it pushes the values really low, hence,

allowing a small radius and vice versa. The idea of constraints over penalties is important for

several reasons. Large penalties might cause non-convex optimization algorithms to get stuck

in local minima due to small values of θ, leading to the formation of so-called dead cells, as

the weights entering and leaving them are too small to have an impact.

Constraints don’t enforce the weights to be near zero, rather being confined to a constrained

region.

3. Regularized & Under-constrained problems

Underdetermined problems are those problems that have infinitely many solutions. A logistic

regression problem having linearly separable classes with as a solution, will always

have 2w as a solution and so on. In some machine learning problems, regularization is

necessary. For e.g., many algorithms require the inversion of X’ X, which might be singular. In

such a case, we can use a regularized form instead. (X’ X + αI) is guaranteed to be invertible.
Regularization can solve underdetermined problems. For e.g. the Moore-Pentose proposed

pseudoinverse defined earlier as:

This can be applied in performing a linear regression with L²-regularization.

Many linear models in machine learning, including linear regression depend on inverting the

whenever the data generating distribution truly has no variance in some direction, or when

no variance in observed in some direction because there are fewer examples (rows of X)

than input features (columns of X). In this case, many forms of regularization correspond to

inverti

Data Augmentation

The simplest way to reduce overfitting is to increase the size of the training data. In machine

learning, we were not able to increase the size of training data as the labeled data was too

costly.

But, now let’s consider we are dealing with images. In this case, there are a few ways of

increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In

the below image, some transformation has been done on the handwritten digits dataset.
This technique is known as data augmentation. This usually provides a big leap in improving

the accuracy of the model. It can be considered as a mandatory trick in order to improve our

predictions.

Below is the implementation code example

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images

datagen.fit(x_train)

Dropout
This is the one of the most interesting types of regularization techniques. It also produces
very good results and is consequently the most frequently used regularization technique in
the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one shown

So what does dropout do? At every iteration, it randomly selects some nodes and removes
them along with all of their incoming and outgoing connections as shown below.

So each iteration has a different set of nodes and this results in a different set of outputs. It
can also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network model.

This probability of choosing how many nodes should be dropped is the hyperparameter of
the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
Due to these reasons, dropout is usually preferred when we have a large neural network
structure in order to introduce more randomness.

In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.5 as the probability of dropping in my
neural network architecture after last hidden layer having 64 kernels and after first dense
layer having 500 neurons.

example
linkcode
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
# 1st dropout
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
# 2nd dropout
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Early stopping

Early stopping is a kind of cross-validation strategy where we keep one part of the training

set as the validation set. When we see that the performance on the validation set is getting

worse, we immediately stop the training on the model. This is known as early stopping.

In the above image, we will stop training at the dotted line since after that our model will

start overfitting on the training data.

in keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop immendiately if
validation error will not decreased after 3 epochs.

In [14]:
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc', patience=3)
epochs = 20 #
batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes the
validation error.
Patience denotes the number of epochs with no further improvement after which the
training will be stopped. For better understanding, let’s take a look at the above image
again. After the dotted line, each epoch will result in a higher value of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our model will
stop because no further improvement is seen.

Noise Robustness

Noise applied to inputs is a data augmentation, For some models addition of noise with
extremely small variance at the input is equivalent to imposing a penalty on the norm of the
weights.
Noise applied to hidden units, Noise injection can be much more powerful than simply
shrinking the parameters. Noise applied to hidden units is so important that Dropout is the
main development of this approach.

Training a neural network with a small dataset can cause the network to memorize all
training examples, in turn leading to overfitting and poor performance on a holdout dataset.
One approach to making the input space smoother and easier to learn is to add noise to
inputs during training.

 Small datasets can make learning challenging for neural nets and the examples can be
memorized.
 Adding noise during training can make the training process more robust and reduce
generalization error.
 Noise is traditionally added to the inputs, but can also be added to weights, gradients, and
even activation functions.

random noise can be added to other parts of the network during training. Some examples
include:

 Add noise to activations, i.e. the outputs of each layer.


 Add noise to weights, i.e. an alternative to the inputs.
 Add noise to the gradients, i.e. the direction to update weights.
 Add noise to the outputs, i.e. the labels or target variables.
The addition of noise to the layer activations allows noise to be used at any point in the
network. This can be beneficial for very deep networks. Noise can be added to the layer
outputs themselves, but this is more likely achieved via the use of a noisy activation
function.

The addition of noise to weights allows the approach to be used throughout the network in
a consistent way instead of adding noise to inputs and layer activations. This is particularly
useful in recurrent neural networks.
The addition of noise to gradients focuses more on improving the robustness of the
optimization process itself rather than the structure of the input domain. The amount of
noise can start high at the beginning of training and decrease over time, much like a
decaying learning rate. This approach has proven to be an effective method for very deep
networks and for a variety of different network types

Adding noise to the activations, weights, or gradients all provide a more generic approach to
adding noise that is invariant to the types of input variables provided to the model.

If the problem domain is believed or expected to have mislabeled examples, then the
addition of noise to the class label can improve the model’s robustness to this type of error.
Although, it can be easy to derail the learning process.

Adding noise to a continuous target variable in the case of regression or time series
forecasting is much like the addition of noise to the input variables and may be a better use
case.

Semi-Supervised Learning
Semi-supervised learning is a type of machine learning that falls in between supervised
and unsupervised learning. It is a method that uses a small amount of labeled data and a
large amount of unlabeled data to train a model. The goal of semi-supervised learning is to
learn a function that can accurately predict the output variable based on the input
variables, similar to supervised learning. However, unlike supervised learning, the
algorithm is trained on a dataset that contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of unlabeled
data available, but it’s too expensive or difficult to label all of it.

Examples of Semi-Supervised Learning


 Text classification: In text classification, the goal is to classify a given text into one or
more predefined categories. Semi-supervised learning can be used to train a text
classification model using a small amount of labeled data and a large amount of
unlabeled text data.
 Image classification: In image classification, the goal is to classify a given image into
one or more predefined categories. Semi-supervised learning can be used to train an
image classification model using a small amount of labeled data and a large amount of
unlabeled image data.
 Anomaly detection: In anomaly detection, the goal is to detect patterns or
observations that are unusual or different from the norm
Applications of Semi-Supervised Learning
1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and
unfeasible process and thus uses Semi-Supervised learning algorithms. Even the
Google search algorithm uses a variant of Semi-Supervised learning to rank the
relevance of a webpage for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

Disadvantages of Semi-Supervised Learning


The most basic disadvantage of any Supervised Learning algorithm is that the dataset has
to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a
very costly process, especially when dealing with large volumes of data. The most basic
disadvantage of any Unsupervised Learning is that its application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was
introduced. In this type of learning, the algorithm is trained upon a combination of labeled
and unlabelled data. Typically, this combination will contain a very small amount of
labeled data and a very large amount of unlabelled data. The basic procedure involved is
that first, the programmer will cluster similar data using an unsupervised learning
algorithm and then use the existing labeled data to label the rest of the unlabelled data.

Multi-Task Learning

Multi-Task Learning (MTL) is a type of machine learning technique where a model is


trained to perform multiple tasks simultaneously. In deep learning, MTL refers to training
a neural network to perform multiple tasks by sharing some of the network’s layers and
parameters across tasks.
In MTL, the goal is to improve the generalization performance of the model by leveraging
the information shared across tasks. By sharing some of the network’s parameters, the
model can learn a more efficient and compact representation of the data, which can be
beneficial when the tasks are related or have some commonalities.

Hard Parameter Sharing – A common hidden layer is used for all tasks but several task
specific layers are kept intact towards the end of the model. This technique is very useful
as by learning a representation for various tasks by a common hidden layer, we reduce the
risk of overfitting.
Soft Parameter Sharing – Each model has their own sets of weights and biases and
the distance between these parameters in different models is regularized so that
the parameters become similar and can represent all the tasks.

Assumptions and Considerations –


Using MTL to share knowledge among tasks are very useful only when the tasks are very
similar, but when this assumption is violated, the performance will significantly decline.
Applications: MTL techniques have found various uses, some of the major applications
are-
 Object detection and Facial recognition
 Self Driving Cars: Pedestrians, stop signs and other obstacles can be detected together
 Multi-domain collaborative filtering for web applications
 Stock Prediction
 Language Modelling and other NLP applications

Multi-Task Learning (MTL) for deep learning important observations :


1. Task relatedness: MTL is most effective when the tasks are related or have some
commonalities, such as natural language processing, computer vision, and healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the model to
leverage the information shared across tasks to improve the generalization
performance.
3. Shared feature extractor: A common approach in MTL is to use a shared feature
extractor, which is a part of the network that is shared across tasks and is used to
extract features from the input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each task and
are typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-making
layer, where the decision-making layer is shared across tasks, and the task-specific
layers are connected to the shared decision-making layer.

Parameter Typing
Two models are doing the same classification task (with the same set of classes), but their
input distributions are somewhat different.
 We have model A has the parameters
 Another model B has the parameters .
W(A)

and
W(B)

are the two models that transfer the input to two different but related outputs.

Assume the tasks are comparable enough (possibly with similar input and output
distributions) that the model parameters should be near to each

other: should be close to .

We can take advantage of this data by regularising it. We can apply a parameter norm
penalty of the following form We utilised an L2 penalty here, but there are other options.

Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm, were
regularised to be close to the parameters of another model, trained in an unsupervised
paradigm, using this method (to capture the distribution of the observed input data).
Many of the parameters in the classifier model might be linked with similar parameters in
the unsupervised model thanks to the designs.

While a parameter norm penalty is one technique to require sets of parameters to be


equal, constraints are a more prevalent way to regularise parameters to be close to one
another. Because we view the numerous models or model components as sharing a
unique set of parameters, this form of regularisation is commonly referred to as
parameter sharing. The fact that only a subset of the parameters (the unique set) needs to
be retained in memory is a significant advantage of parameter sharing over regularising
the parameters to be close (through a norm penalty). This can result in a large reduction in
the memory footprint of certain models, such as the convolutional neural network.

Example : Convolutional neural networks (CNNs) used in computer vision are by far the
most widespread and extensive usage of parameter sharing. Many statistical features of
natural images are translation insensitive. A shot of a cat, for example, can be translated
one pixel to the right and still be a shot of a cat. By sharing parameters across several
picture locations, CNNs take this property into account. Different locations in the input are
computed with the same feature (a hidden unit with the same weights).

Sparse Representations

Sparse representation (SR) is used to represent data with as few atoms as possible in a given
overcomplete dictionary. By using the SR, we can concisely represent the data and easily
extract the valuable information from the data

Sparse representations classification (SRC) is a powerful technique for pixelwise


classification of images and it is increasingly being used for a wide variety of image analysis
tasks. The method uses sparse representation and learned redundant dictionaries to classify
image pixels.

sparse representation attracts great attention as it can significantly save computing


resources and find the characteristics of data in a low-dimensional space. Thus, it can be
widely applied in engineering fields such as dictionary learning, signal reconstruction, image
As real-world data becomes more diverse and complex, it becomes hard to completely
reveal the intrinsic structure of data with commonly used approaches. This has led to the
exploration of more practicable representation models and efficient optimization
approaches. New formulations such as deep sparse representation, graph-based sparse
representation, geometry-guided sparse representation, and group sparse representation
have achieved remarkable success

the terms "sparse" and "dense" are commonly used to describe the distribution of zero
and non-zero array members in machine learning (e.g. vector or matrix). Sparse matrices
are those that primarily consist of zeros, while dense matrices have a large number of
nonzero entries.

Machine learning makes use of sparse and dense representations due to their usefulness in
efficient data representation. While dense representations are useful for capturing intricate
interactions between data points, sparse representations can help minimize the amount of
a dataset.

 Sparse representations have the potential to be more resilient to noise and


produce more interpretable outcomes. For calculations, dense representations are
typically more effective since they can be processed more quickly. On top of that,
dense representations are useful for tasks like classification and regression
because they can capture intricate connections between data points.
 Sparse representations are helpful for reducing the dimensionality of the data in
tasks like natural language processing and picture recognition. Further, sparse
representations can be utilized to capture only the most crucial elements of the
data, which can greatly cut down on the time needed to train a model.

 Dense representations are able to capture complicated interactions between data


points, they are frequently employed in machine learning and can be especially
helpful for tasks like classification and regression. Because of their increased
computational efficiency, dense representations can also shorten the time it takes
to train a model.

A matrix is a two-dimensional data object made of m rows and n columns, therefore


having total m x n values. If most of the elements of the matrix have 0 value, then it is
called a sparse matrix.
Why to use Sparse Matrix instead of simple matrix ?
 Storage: There are lesser non-zero elements than zeros and thus lesser memory can
be used to store only those elements.
 Computing time: Computing time can be saved by logically designing a data structure
traversing only non-zero elements..

sparse Matrix Representations can be done in many ways following are two common
representations:
1. Array representation
2. Linked list representation

Example -

Let's understand the array representation of sparse matrix with the help of the example
given below -

Consider the sparse matrix -

In the above figure, we can observe a 5x4 sparse matrix containing 7 non-zero elements and
13 zero elements. The above matrix occupies 5x4 = 20 memory space. Increasing the size of
matrix will increase the wastage space.

The tabular representation of the above matrix is given below -


In the above structure, first column represents the rows, the second column represents the
columns, and the third column represents the non-zero value. The first row of the table
represents the triplets. The first triplet represents that the value 4 is stored at 0th row and
1st column. Similarly, the second triplet represents that the value 5 is stored at the 0th row
and 3rd column. In a similar manner, all triplets represent the stored location of the non-
zero elements in the matrix.

The size of the table depends upon the total number of non-zero elements in the given
sparse matrix. Above table occupies 8x3 = 24 memory space which is more than the space
occupied by the sparse matrix. So, what's the benefit of using the sparse matrix? Consider
the case if the matrix is 8*8 and there are only 8 non-zero elements in the matrix, then the
space occupied by the sparse matrix would be 8*8 = 64, whereas the space occupied by the
table represented using triplets would be 8*3 = 24.

Example -

Let's understand the linked list representation of sparse matrix with the help of the example
given below -

Consider the sparse matrix


In the above figure, we can observe a 4x4 sparse matrix containing 5 non-zero elements and
11 zero elements. Above matrix occupies 4x4 = 16 memory space. Increasing the size of
matrix will increase the wastage space.

The linked list representation of the above matrix is given below -

In the above figure, the sparse matrix is represented in the linked list form. In the node, the
first field represents the index of the row, the second field represents the index of the
column, the third field represents the value, and the fourth field contains the address of the
next node.

In the above figure, the first field of the first node of the linked list contains 0, which means
0th row, the second field contains 2, which means 2nd column, and the third field contains 1
that is the non-zero element. So, the first node represents that element 1 is stored at the
0th row-2nd column in the given sparse matrix. In a similar manner, all of the nodes represent
the non-zero elements of the sparse matrix.

Sparse Coding representation in Neural Networks

sparse code follows the more all-encompassing idea of neural code. Consider the case
when you have binary neurons. So, basically:

 The neural networks will get some inputs and deliver outputs
 Some neurons in the neural network will be frequently activated while others won’t
be activated at all to calculate the outputs
 The average activity ratio refers to the number of activations on some data, whereas
the neural code is the observation of those activations for a specific input
 Neural coding is the process of instructing your neurons to produce a reliable neural
code

Now that we know what a neural code is, we can speculate on what it may be like. Then,
data will be encoded using a sparse code while taking into consideration the following
scenarios:

 No neurons are even activated


 One neuron alone is activated
 Half of the neurons are active
These are the methods which are being followed to represent image and its classifications

Ensemble Learning Methods: Bagging, Boosting

Ensemble learning is a machine learning technique combining multiple individual models to


create a stronger, more accurate predictive model. By leveraging the diverse strengths of
different models, ensemble learning aims to mitigate errors, enhance performance, and
increase the overall robustness of predictions, leading to improved results across various
tasks in machine learning and neural networks .

Bagging or Bootstrap Aggregating is an ensemble learning method that is used to reduce the
error by training homogeneous weak learners on different random samples from the
training set, in parallel. The results of these base learners are then combined through voting
or averaging approach to produce an ensemble model that is more robust and accurate.

Bagging mainly focuses on obtaining an ensemble model with lower variance than the
individual base models composing it. Hence, bagging techniques help avoid the overfitting
of the model.
Benefits of Bagging
 Reduce Overfitting
 Improve Accuracy
 Handles Unstable Models
Note: Random Forest Algorithm is one of the most common Bagging Algorithm.

Steps of Bagging Technique


 Randomly select multiple bootstrap samples from the training data with replacement and
train a separate model on each sample.
 For classification, combine predictions using majority voting. For regression, average the
predictions.
 Assess the ensemble’s performance on test data and use the aggregated models for
predictions on new data.
 If needed, retrain the ensemble with new data or integrate new models into the existing
ensemble.
Example of Bagging and boosting
The main idea behind ensemble learning is the usage of multiple algorithms and models that
are used together for the same task. While single models use only one algorithm to create
prediction models, bagging and boosting methods aim to combine several of those to
achieve better prediction with higher consistency compared to individual learnings.

Image classification

Supposing a collection of images, each accompanied by a categorical label corresponding to

the kind of animal, is available for the purpose of training a model. In a traditional modeling

approach, we would try several techniques and calculate the accuracy to choose one over

the other. Imagine we used logistic regression, decision tree, and support vector machines

here that perform differently on the given data set.

In the above example, it was observed that a specific record was predicted as a dog by the

logistic regression and decision tree models, while a support vector machine identified it as a
cat. As various models have their distinct advantages and disadvantages for particular
records, it is the key idea of ensemble learning to combine all three models instead of

selecting only one approach that showed the highest accuracy.

The procedure is called aggregation or voting and combines the predictions of all underlying

models, to come up with one prediction that is assumed to be more precise than any sub-

model that would stay alone.


Boosting is an ensemble learning method that involves training homogenous weak
learners sequentially such that a base model depends on the previously fitted base models.
All these base learners are then combined in a very adaptive way to obtain an ensemble
model.
In boosting, the ensemble model is the weighted sum of all constituent base learners. There
are two meta-algorithms in boosting that differentiate how the base models are aggregated:
 Adaptive Boosting (AdaBoost)
 Gradient Boosting

 XGBoost

Benefits of Boosting Techniques


 High Accuracy

 Adaptive Learning
 Reduces Bias
 Flexibility
How is Boosting Model Trained to Make Predictions
 Samples generated from the training set are assigned the same weight to start with.
These samples are used to train a homogeneous weak learner or base model.
 The prediction error for a sample is calculated – the greater the error, the weight of the
sample increases. Hence, the sample becomes more important for training the next base
model.
 The individual learner is weighted too – does well on its predictions, gets a higher
weight assigned to it. So, a model that outputs good predictions will have a higher say in
the final decision.
 The weighted data is then passed on to the following base model, and steps 2 and step 3
are repeated until the data is fitted well enough to reduce the error below a certain
threshold.
 When new data is fed into the boosting model, it is passed through all individual base
models, and each model makes its own weighted prediction.
 Weight of these models is used to generate the final prediction. The predictions are
scaled and aggregated to produce a final prediction.
Key Difference Between Bagging and Boosting
 The bagging technique combines multiple models trained on different subsets of data,
whereas boosting trains models sequentially, focusing on the error made by the previous
model.
 Bagging is best for high variance and low bias models while boosting is effective when the
model must be adaptive to errors, suitable for bias and variance errors.
 Generally, boosting techniques are not prone to overfitting. Still, it can be if the number
of models or iterations is high, whereas the Bagging technique is less prone to overfitting.
 Bagging improves accuracy by reducing variance, whereas boosting achieves accuracy by
reducing bias and variance.
 Boosting is suitable for bias and variance, while bagging is suitable for high-variance and
low-bias models.

About bias and variance used in bagging and boosting

Bias:While making predictions, a difference occurs between prediction values made by the
model and actual values/expected values, and this difference is known as bias errors or
Errors due to bias
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Variance:the variance would specify the amount of variation in the prediction if the
different training data was used. In simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a model should not vary too much
from one training dataset to another, which means the algorithm should be good in
understanding the hidden mapping between inputs and output variables. Variance errors
are either of low variance or high variance.

o Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.

Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Tangent propagation is a way of regularizing neural nets. It encourages the representation


to be invariant by penalizing large changes in the representation when small
transformations are applied to the inputs.

It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).

Tangent propagation is closely related to dataset augmentation. In both cases, the user of
the algorithm encodes his or her prior knowledge of the task by specifying a set of
transformations that should not alter the output of the network.

The difference is that in the case of dataset augmentation, the network is explicitly trained
to correctly classify distinct inputs that were created by applying more than an infinitesimal
amount of these transformations.

tangent propagation does not require explicitly visiting a new input point. Instead, it
analytically regularizes the model to resist perturbation in the directions corresponding to
the specified transformation. While this analytical approach is intellectually elegant,

it has two major drawbacks. First, it only regularizes the model to resist infinitesimal
perturbation. Explicit dataset augmentation confers resistance to larger perturbations(
means changes in datasets) Second, the infinitesimal approach poses difficulties for models
based on rectified linear units. These models can only shrink their derivatives by turning
units off or shrinking their weights.
They are not able to shrink their derivatives by saturating at a high value with large weights,
as sigmoid or tanh units can. Dataset augmentation works well with rectified linear units
because different subsets of rectified units can activate for different transformed versions of
each original input. Tangent propagation is also related to double backprop (Drucker and
LeCun, 1992) and adversarial training

The TANGENTPROP Algorithm TANGENTPROP (Simard et al. 1992) accommodates domain


knowledge expressed as derivatives of the target function with respect to transformations
of its inputs. Consider a learning task involving an instance space X and target function f.

The TANGENTPROP algorithm assumes various training derivatives of the target function are
also provided. For example, if each instance xi is described by a single real value, then each
training example may be of the form (xi, f (xi), q lx, ). Here lx, denotes the derivative of the
target function f with respect to x, evaluated at the point x = xi.

To develop an intuition for the benefits of providing training derivatives as well as training
values during learning, consider the simple learning task depicted in Figure

The task is to learn the target function f shown in the leftmost plot of the figure, based on
the three training examples shown: (xl, f (xl)), (x2, f (x2)), and (xg, f (xg)).

Given these three training examples, the BACKPROPAGATION algorithm can be expected to
hypothesize a smooth function, such as the function g depicted in the middle plot of the
figure. The rightmost plot shows the effect of

providing training derivatives, or slopes, as additional information for each training example
(e.g., (XI, f (XI), I,, )). By fitting both the training values f (xi) and these training derivatives PI,,
the learner has a better chance to correctly generalize from the sparse training data.

To summarize, the impact of including the training derivatives is to override the usual
syntactic inductive bias of BACKPROPAGATION that favors a smooth interpolation between
points, replacing it by explicit input information about required derivatives. The resulting
hypothesis h shown in the rightmost plot of the figure provides a much more accurate
estimate of the true target function f.
Each transformation must be of the form sj(a, x) where aj is a continuous parameter, where
sj is differentiable, and where sj(O, x) = x (e.g., for rotation of zero degrees the
transformation is the identity function). For each such transformation, sj(a, x),

In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and these
instances fit to proper hypothesis shown in first figure and in second fig we can see the
instances classified and machine learns to fit to proper hypothesis by doing necessary
modification by using

TANGEPROP considers the squared error between the specified training derivative and the
actual derivative of the learned neural network. The modified error function is

where p is a constant provided by the user to determine the relative importance of fitting
training values versus fitting training derivatives.

Notice the first term in this definition of E is the original squared error of the network versus
training values, and the second term is the squared error in the network versus training
derivatives.

In the third figure we can see the instances are classified properly and maintaining accuracy.

An Illustrative Example
Remarks To summarize, TANGENTPROP uses prior knowledge in the form of desired
derivatives of the target function with respect to transformations of its inputs.

It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).
UNIT - V

Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic
Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates,
Approximate Second Order Methods, Optimization Strategies and Meta-Algorithms

Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural


Language Processing

The Challenges of Optimizing Deep Learning Models

There are several types of optimization in deep learning algorithms but the most interesting

ones are focused on reducing the value of cost functions.

Some Basics of Optimization in Deep Learning Models

The core of deep learning optimization relies on trying to minimize the cost function of a

model without affecting its training performance. That type of optimization problem

contrasts with the general optimization problem in which the objective is to simply minimize

a specific indicator without being constrained by the performance of other elements(

ex:training).

Most optimization algorithms in deep learning are based on gradient estimations. In that

context, optimization algorithms try to reduce the gradient of specific cost functions
algorithms that use the entire training set at once are called deterministic. Other techniques

that use one training example at a time has come to be known as online algorithms. Similarly,

algorithms that use more than one but less than the entire training dataset during the

optimization process are known as minibatch stochastic or simply stochastic.

The most famous method of stochastic optimization which is also the most common

algorithm in deep learning solution is known as stochastic gradient descent(SGD)(read my

previous article about SGD).

Regardless of the type of optimization algorithm used, the process of optimizing a deep

learning model is a careful path full of challenges.

Common Challenges in Deep Learning Optimization

There are plenty of challenges in deep learning optimization but most of them are related to

the nature of the gradient of the model. Below, I’ve listed some of the most common

challenges in deep learning optimization that you are likely to run into:

a)Local Minima: local minima is a permanent challenge in the optimization of any deep

learning algorithm. The local minima problem arises when the gradient encounters many

local minimums that are different and not correlated to a global minimum for the cost

function.
B.saddle points

saddle points are another reason for gradients to vanish. A saddle point is any location

where all gradients of a function vanish but which is neither a global nor a local minimum.

Flat Regions: In deep learning optimization models, flat regions are common areas that

represent both a local minimum for a sub-region and a local maximum for another. That

duality often causes the gradient to get stuck.


intractable which forces an inexact estimation of the gradient. In these cases, the inexact

gradients introduce a second layer of uncertainty in the model.

d) Local vs. Global Structures: Another very common challenge in the optimization of deep

leavening models is that local regions of the cost function don’t correspond with its global

structure producing a misleading gradient.

Vanishing and Exploding Gradients


Deep learning networks can be problematic when the numbers change too quickly or slowly
through many layers. This can make it hard for the network to learn and stay stable. This can
cause difficulties for the network in learning and remaining stable.

Solution: Gradient clipping, advanced weight initialization, and skip connections help a
computer learn things accurately and consistently.

Overfitting
Overfitting happens when a model knows too much about the training data, so it can't make
good predictions about new data. As a result, the model performs well on the training data
but struggles to make accurate predictions on new, unseen data. It's essential to address
overfitting by employing techniques like regularization, cross-validation, and more diverse
datasets to ensure the model generalizes well to unseen examples.
Regularisation techniques help us ensure our models memorize the data and use what
they've learned to make good predictions about new data. Techniques like dropout, L1/L2
regularisation, and early stopping can help us do this.
the model during training, enabling it to learn more effectively and make accurate
predictions.
Solution: Apply data augmentation techniques like rotation, translation, and flipping
alongside data normalization and proper handling of missing values.
Label Noise
Training data sometimes need to be corrected, making it hard for computers to do things
well.
Solution: Using special kinds of math called "loss functions" can help ensure that the model
you are using is not affected by label mistakes.
Imbalanced Datasets
Datasets can have too many of one type of thing and need more of another type. This can
cause models not to work very well for things not represented as much.
Solution: Classes can sometimes be uneven, meaning more people are in one group than
another. To fix this, we can use special techniques like class weighting, oversampling, or
data synthesis to ensure that all the classes have the same number of people.
Computational Resource Constraints
Training deep neural networks can be very difficult and take a lot of computer power,
especially if the model is very big.
Solution: Using multiple computers or special chips called GPUs and TPUs can help make
learning faster and easier.
Hyperparameter Tuning
Deep neural networks have numerous hyperparameters that require careful tuning to
achieve optimal performance.
Solution: To efficiently find the best hyperparameters, such as Bayesian optimization or
genetic algorithms, utilize automated hyperparameter optimization methods.
Convergence Speed
It is important to ensure a model works quickly when using lots of data and complicated
designs.
Solution: Adopt learning rate scheduling or adaptive algorithms like Adam or RMSprop to
expedite convergence.
Memory Constraints
Computers need a lot of memory to train large models and datasets, but they can work
properly if there is enough memory.
Solution: Reduce memory usage by applying model quantization, using mixed-precision
training, or employing memory-efficient architectures like MobileNet or EfficientNet.
Transfer Learning and Domain Adaptation
Deep learning networks need lots of data to work well. If they don't get enough data or the
data is different, they won't work as well.
Solution: Leverage transfer learning or domain adaptation techniques to transfer knowledge
from pre-trained models or related domains.
.
Adversarial Attacks
Deep neural networks are unique ways of understanding data. But they can be tricked by
minimal changes that we can't see. This can make them give wrong answers.
Interpretability and Explainability
Understanding the decisions made by deep neural networks is crucial in critical applications
like healthcare and autonomous driving.
Solution: Adopt techniques such as LIME (Local Interpretable Model-Agnostic Explanations)
or SHAP (SHapley Additive exPlanations) to explain model predictions.
Handling Sequential Data
Training deep neural networks on sequential data, such as time series or natural language
sequences, presents unique challenges.
Solution: Utilize specialized architectures like recurrent neural networks (RNNs) or
transformers to handle sequential data effectively.
Limited Data
Training deep neural networks with limited labeled data is a common challenge, especially
in specialized domains.
Solution: Consider semi-supervised, transfer, or active learning to make the most of
available data.

Catastrophic Forgetting
When a model forgets previously learned knowledge after training on new data, it
encounters the issue of catastrophic forgetting.
Solution: Implement techniques like elastic weight consolidation (EWC) or knowledge
distillation to retain old knowledge during continual learning.
Hardware and Deployment Constraints
Using trained models on devices with not much computing power can be hard.
Solution: Scientists use special techniques to make computer models run better on devices
with limited resources.
Data Privacy and Security
When training computers to do complex tasks, it is essential to keep data private and ensure
the computers are secure.
Solution: Employ federated learning, secure aggregation, or differential privacy techniques
to protect data and model privacy.
Long Training Times
Training deep neural networks is like doing a challenging puzzle. It takes a lot of time to
assemble the puzzle, especially if it is vast and has a lot of pieces.
Solution: Special tools like GPUs or TPUs can help us train our computers faster. We can also
try using different computers simultaneously to make the training even quicker.
Exploding Memory Usage
Some models are too big and need a lot of space, so they are hard to use on regular
computers.
Solution: Explore memory-efficient architectures, use gradient checkpointing, or consider
model parallelism for training.
Learning Rate Scheduling
Setting an appropriate learning rate schedule can be challenging, affecting model
convergence and performance.
Solution: Using special learning rate schedules can help make learning easier and faster.
These schedules can be used to help teach things in a better way.
final performance.
Solution: Using unique strategies like simulated annealing, momentum-based optimization,
and evolutionary algorithms can help us escape difficult spots.
Unstable Loss Surfaces
Finding the best way to do something can be very hard when there are many different
options because the surface it is on is complicated and bumpy.
Solution: Utilize weight noise injection, curvature-based optimization, or geometric
methods to stabilize loss surfaces.

Ill-Conditioned Matrix

In neural network the adjustments of weights computation and calculation in hidden layer

when calculate in matrix form it simply tells us the characteristics of the matrix in terms of

further computations and calculations, or formally it can be defined as a measure of how much

the output value of the function can change for a small change in the input argument.

A matrix is said to be Ill-conditioned if the condition number is very high, so for a small

change in the input function/the Hessian matrix (The Hessian Matrix is a square matrix of

second ordered partial derivatives of a scalar function. It is of immense use in linear algebra

as well as for determining points of local maxima or minima. ) we will end up getting outputs

with high variance

Basic Algorithms

Gradient Descent is an iterative optimization process that searches for an


objective function’s optimum value (Minimum/Maximum). It is one of the most
used methods for changing a model’s parameters in order to reduce a cost
Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning models. It
addresses the computational inefficiency of traditional Gradient Descent
methods when dealing with large datasets in machine learning projects.

In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.
Stochastic Gradient Descent Algorithm
 Initialization: Randomly initialize the parameters of the model.
 Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
 Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the
shuffled order.
c. Compute the gradient of the cost function with respect to the
model parameters using the current training example (or
batch).
d. Update the model parameters by taking a step in the direction
of the negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the
cost function between iterations of the gradient.
 Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
Stochastic gradient descent (SGD) with momentum

the momentum algorithm introduces a variable v that plays the role of velocity—it is the direction

and speed at which the parameters move through parameter space. The velocity is set to an

exponentially decaying average of the negative gradient. The name momentum derives from a

physical analogy, in which the negative gradient is a force moving a particle through parameter

space, according to Newton’s laws of motion. Momentum in physics is mass times velocity. In the

momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded

as the momentum of the particle The algorithm is balanced with momentum and steps velocity is

added as

SGD is generally noisier than typical Gradient Descent, it usually took a


higher number of iterations to reach the minima, because of the randomness
expensive than typical Gradient Descent.

Parameter Initialization Strategies

Training algorithms for deep learning models are iterative in nature


and require the specification of an initial point. This is extremely
crucial as it often decides whether or not the algorithm converges
and if it does, then does the algorithm converge to a point with high
cost or low cost.

We have limited understanding of neural network optimization but


the one property that we know with complete certainty is that the
initialization should break symmetry. This means that if two
hidden units are connected to the same input units, then these
should have different initialization or else the gradient would update
both the units in the same way and we don’t learn anything new by
using an additional unit. The idea of having each unit learn
something different motivates random initialization of weights
which is also computationally cheaper.

Biases are often chosen heuristically (zero mostly) and only the
weights are randomly initialized, almost always from a Gaussian or
uniform distribution. The scale of the distribution is of utmost
concern. Large weights might have better symmetry-breaking effect
but might lead to chaos (extreme sensitivity to small perturbations
in the input) and exploding values during forward & back
propagation. As an example of how large weights might lead to
would add a factor of W * ϵ to the output. In case the weights are
high, this ends up making a significant contribution to the output.
SGD and its variants tend to halt in areas near the initial values,
thereby expressing a prior that the path to the final parameters from
the initial values is discoverable by steepest descent algorithms.

Various suggestions have been made for appropriate initialization of


the parameters. The most commonly used ones include sampling the
weights of each fully-connected layer having m inputs and n outputs
uniformly from the following distributions:

 U(-1 / √m, 1 / √m)

 U(- √6 / (m+n), √6 / (m+n))

U(a, b) represents the uniform distribution where the probability of each value between a and
b, a and b inclusive, is 1/(b-a). The probability of every other value is 0.

These initializations have already been incorporated into the most


commonly used Deep Learning frameworks nowadays so that you
can just specify which initializer to use and the framework takes care
of sampling appropriately. For e.g. Keras, which is a very famous
deep learning framework, has a module called initializers, where the
as glorot_uniform .

One drawback of using 1 / √m as the standard deviation is that the


weights end up being small when a layer has too many input/output
units. Motivated by the idea to have the total amount of input to
each unit independent of the number of input units m, Sparse
initialization sets each unit to have exactly k non-zero weights.
However, it takes a long time for GD to correct incorrect large
values and hence, this initialization might cause problems.

If the weights are too small, the range of activations across the mini-
batch will shrink as the activations propagate forward through the
network.By repeatedly identifying the first layer with unacceptably
small activations and increasing its weights, it is possible to
eventually obtain a network with reasonable initial activations
throughout.

The biases are relatively easier to choose. Setting the biases to zero is
compatible with most weight initialization schemes except for a few
cases .

Algorithms with Adaptive Learning Rates

 AdaGrad: it is important to incrementally decrease the


learning rate for faster convergence. Instead of manually
reducing the learning rate after each (or several) epochs, a
better approach is to adapt the learning rate as the training
progresses. This can be done by scaling the learning rates
to the square root of the sum of historical squared values of
the gradient. In the parameter update equation below, r is
initialized with 0 and the multiplication in the update step
happens element-wise as mentioned. Since the gradient value
would be different for each parameter, the learning rate is
scaled differently for each parameter too.

 Thus, those parameters having a large gradient have a large


decrease in the learning rate as the learning rate might be too
high leading to oscillations or it might be approaching the
minima but having large learning rate might cause it to jump
over the minima as explained in the figure below, because of
which the learning rate should be decreased for better
convergence, while those with small gradients have a small
decrease in the learning rate as they might have already
approached their respective minima and should not be pushed
away from that. Even if they have not, reducing the learning
rate too much would reduce the gradient even further leading
to slower learning.

AdaGrad parameter update equation.


This figure illustrates the need to reduce the learning rate if gradient is
large in case of a single parameter. 1) One step of gradient descent
representing a large gradient value. 2) Result of reducing the learning rate
— moves towards the minima 3) Scenario if the learning rate was not
reduced — it would have jumped over the minima.

However, accumulation of squared gradients from the very


beginning can lead to excessive and premature decrease in the
learning rate. Consider that we had a model with only 2 parameters
(for simplicity) and both the initial gradients are 1000.

After some iterations, the gradient of one of the


Figure explaining the problem with AdaGrad. Accumulated gradients can
cause the learning rate to be reduced far too much in the later stages
leading to slower learning.

parameters has reduced to 100 but that of the other parameter is


still around 750. However, because of the accumulation at each
update, the accumulated gradient would still have almost the same
value. For e.g. let the accumulated gradients at each step for
the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100,

1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 +

750 = 4300, 1/4300 = 0.0002. This would lead to a similar decrease in


the learning rates for both the parameters, even though the
parameter having the lower gradient might have its learning rate
reduced too much leading to slower learning.

 RMSProp: RMSProp addresses the problem caused by


accumulated gradients in AdaGrad. It modifies the gradient
accumulation step to an exponentially weighted moving
average in order to discard history from the extreme past. The
RMSProp update is given by:
ρ is the weighing used for exponential averaging. As more updates are
made, the contribution of past gradient values are reduced since ρ < 1
and ρ > ρ² >ρ³ …

This allows the algorithm to converge rapidly after finding a convex


bowl, as if it were an instance of AdaGrad initialized within that
bowl. . Consider the figure below. The region represented
by 1 indicates usual RMSProp parameter updates as given by the
update equation, which is nothing but exponentially averaged
AdaGrad updates. Once the optimization process lands on A, it
essentially lands at the top of a convex bowl. At this point,
intuitively, all the updates before A can be seen to be forgotten due
to the exponential averaging and it can be seen as if (exponentially
averaged) AdaGrad updates start from point A onwards.
the convex bowl, exponentially weighted averaging would cause the effect
of earlier gradients to reduce and to simplify, we can assume their
contribution to be zero. This can be seen as if AdaGrad had been used with
the training initiated inside the convex bowl

 Adam: Adapted from “adaptive moments”, it focuses on


combining RMSProp and Momentum. Firstly, it views
Momentum as an estimate of the first-order moment and
RMSProp as that of the second moment. The weight update
for Adam is given by:

Secondly, since s and r are initialized as zeros, the authors observed


a bias during the initial steps of training thereby adding a correction
term for both the moments to account for their initialization near
the origin. As an example of what the effect of this bias correction is,
we’ll look at the values of s and r for a single parameter (in which
case everything is now represented as a scalar). Let’s first
understand what would happen if there was no bias correction.
Since s (notice that this is not in bold as we are looking at the value
for a single parameter and the s here is a scalar) is initialized as zero,
after the first iteration, the value of s would be (1 — ρ1) * g and that
0.9 and 0.99 respectively. Thus, the initial values of s and r are
pretty small and this gets compounded as the training progress.
However, if we now use bias correction, after the first iteration, the
value of s is just g and that of r is just g². This gets rid of the bias
that occurs in the initial phase of training. A major advantage of
Adam is that it’s fairly robust to the choice of these
hyperparameters, i.e. ρ1 and ρ2.

3. Approximate Second-Order Methods

The optimization algorithms that we’ve looked at till now involved


computing only the first derivative. But there are many methods
which involve higher order derivatives as well. The main problem
with these algorithms are that they are not practically feasible in
their vanilla form and so, certain methods are used to approximate
the values of the derivatives. We explain three such methods, all of
which use empirical risk as the objective function:

 Newton’s Method: This is the most common higher-order


derivative method used. It makes use of the curvature of the
loss function via its second-order derivative to arrive at the
optimal point. Using the second-order Taylor Series
expansion to approximate J(θ) around a point θo and ignoring
derivatives of order greater than 2 (this has already been
discussed in previous chapters), we get:
We know that we get a critical point for any function f(x) by solving
for f'(x) = 0. We get the following critical point of the above
equation (refer to the Appendix for proof):

For quadratic surfaces (i.e. where cost function is quadratic), this


directly gives the optimal result in one step whereas gradient
descent would still need to iterate. However, for surfaces that are not
quadratic, as long as the Hessian remains positive definite, we can
obtain the optimal point through a 2-step iterative process — 1) Get
the inverse of the Hessian and 2) update the parameters.

Saddle points are problematic for Newton’s method. If all the


eigenvalues are not positive, Newton’s method might cause the
updates to move in the wrong direction. A way to avoid this is to add
regularization:

However, if there is a strong negative curvature i.e. the eigenvalues


are largely negative, α needs to be sufficiently high to offset the
negative eigenvalues in which case the Hessian becomes dominated
standard gradient divided by α:

Another problem restricting the use of Newton’s method is the


computational cost. It takes O(k³) time to calculate the inverse of the
Hessian where k is the number of parameters. It’s not uncommon
for Deep Neural Networks to have about a million parameters and
since the parameters are updated every iteration, this inverse needs
to be calculated at every iteration, which is not computationally
feasible.

 Conjugate Gradients: One weakness of the method of


steepest descent (i.e. GD) is that line searches happen along
the direction of the gradient. Suppose the previous search
direction is d(t-1). Once the search terminates (which it does
when the gradient along the current gradient direction
vanishes) at the minimum, the next search direction, d(t) is
given by the gradient at that point, which is orthogonal to d(t-
1) (because if it’s not orthogonal, it’ll have some component
along d(t-1) which cannot be true as at the minimum, the
gradient along d(t-1) has vanished).
Upon getting the minimum along the current search direction,
search direction.

In the method of conjugate gradients, we seek a search direction that


is conjugate to the previous line search direction:

Now, the previous search direction contributes towards finding the next
search direction.

with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0. βt


decides how much of d(t-1) is added back to the current search
direction. There are two popular choices for βt — Fletcher-Reeves
and Polak-Ribière. These discussions assumed the cost function to
be quadratic where the conjugate directions ensure that the gradient
along the previous direction does not increase in magnitude. To
extend the concept to work for training neural networks, there is one
additional change. Since it’s no longer quadratic, there’s no
guarantee anymore than the conjugate direction would preserve the
is restarted with line search along the unaltered gradient.

 BFGS: This algorithm tries to bring the advantages of


Newton’s method without the additional computational
burden by approximating the inverse of H by M(t), which is
iteratively refined using low-rank updates. Finally, line search
is conducted along the direction M(t)g(t). However, BFGS
requires storing the matrix M(t) which takes O(n²) memory
making it infeasible. An approach called Limited Memory
BFGS (L-BFGS) has been proposed to tackle this infeasibility
by computing the matrix M(t) using the same method as
BFGS but assuming that M(t−1) is the identity matrix.

4. Optimization Strategies and Meta-Algorithms

 Batch Normalization: Batch normalization (BN) is one of


the most exciting innovations in Deep learning that has
significantly stabilized the learning process and allowed faster
convergence rates. The intuition behind batch normalization
is as follows: Most of the Deep Learning networks are
compositions of many layers (or functions) and the gradient
with respect to one layer is taken considering the other layers
to be constant. However, in practise all the layers are updated
simultaneously and this can lead to unexpected results. For
example, let y* = x W¹ W² … W¹⁰. Here, y* is a linear function
of x but not a linear function of the weights. Suppose the
gradient is given by g and we now intend to reduce y* by 0.1.
Using first-order Taylor Series approximation, taking a step
g) just using the first-order information. However, higher
order effects also creep in as the updated y* is given by:

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi


can be negligibly small or exponentially high depending on whether
the individual weights are less than or greater than 1. Since the
updates to one layer is so strongly dependent on the other layers,
choosing an appropriate learning rate is tough. Batch normalization
takes care of this problem by using an efficient reparameterization of
almost any deep network. Given a matrix of activations, H, the
normalization is given by: H’ = (H-μ) / σ, where the subtraction
and division is broadcasted.

𝛿 is added to ensure that σ is not equal to 0.

Going back to the earlier example of y*, let the activations of


layer l be given by h(l-1). Then h(l-1) = x W1 W2 … W (l-1). Now, if
x is drawn from a unit Gaussian, then h(l-1) also comes from a
Gaussian, however, not of zero mean and unit variance, as it is a
linear transformation of x. BN makes it zero mean and unit variance.
effect. This simplicity was definitely achieved by rendering the lower
layers useless. However, in a realistic deep network with non-
linearities, the lower layers remain useful. Finally, the complete
reparameterization of BN is given by replacing H with γH’ + β. This
is done to retain its expressive power and the fact that the mean is
solely determined by XW. Also, among the choice of
normalizating X or XW + B, the authors recommend the latter,
specifically XW, since B becomes redundant because of β.
Practically, this means that when we are using the Batch
Normalization layer, the biases should be turned off. In a deep
learning framework like Keras, this can be done by setting the
parameter use_bias=False in the Convolutional layer.

 Coordinate Descent: Generally, a single weight update is


made by taking the gradient with respect to every parameter.
However, in cases where some of the parameters might be
independent (discussed below) of the remaining, it might be
more efficient to take the gradient with respect to those
independent sets of parameters separately for making
updates. Let me clarify that with an example. Suppose we have
the following cost function:

This cost function describes the learning problem called sparse


coding. Here, H refers to the sparse representation of X and W is
explanation of why this cost function enforces the learning of a
sparse representation of X follows. The first term of the cost
function penalizes values far from 0 (positive or negative because of
the modulus, |H|, operator. This enforces most of the values to be 0,
thereby sparse. The second term is pretty self-explanatory in that it
compensates the difference between X and H being linearly
transformed by W, thereby enforcing them to take the same value.
In this way, H is now learned as a sparse “representation” of X. The
cost function generally consists of additionally a regularization term
like weight decay, which has been avoided for simplicity. Here, we
can divide the entire list of parameters into two sets, W and H.
Minimizing the cost function with respect to any of these sets of
parameters is a convex problem. Coordinate Descent (CD) refers
to minimizing the cost function with respect to only 1 parameter at a
time. It has been shown that repeatedly cycling through all the
parameters, we are guaranteed to arrive at a local minima. If instead
of 1 parameter, we take a set of parameters as we did before
with W and H, it is called block coordinate descent (the
interested reader should explore Alternating Minimization). CD
makes sense if either the parameters are clearly separable into
independent groups or if optimizing with respect to certain set of
parameters is more efficient than with respect to others.
The points A, B, C and D indicates the locations in the parameter space
where coordinate descent landed after each gradient step.

Coordinate descent may fail terribly when one variable influences


the optimal value of another variable.

 Polyak Averaging: Polyak averaging consists of averaging


several points in the parameter space that the optimization
algorithm traverses through. So, if the algorithm encounters
the points θ(1), θ(2), … during optimization, the output of
Polyak averaging is:

The figure below explains the intuition behind Polyak averaging:


The optimization algorithm might oscillate back and forth across a valley
without ever reaching the minima. However, the average of those points
should be closer to the bottom of the valley.

Most optimization problems in deep learning are non-convex where


the path taken by the optimization algorithm is quite complicated
and it might happen that a point visited in the distant past might be
quite far from the current point in the parameter space. Thus,
including such a point in the distant past might not be useful, which
is why an exponentially decaying running average is taken. This
scheme where the recent iterates are weighted more than the past
ones is called Polyak-Ruppert Averaging:

 Supervised Pre-training: Sometimes it’s hard to directly


train to solve for a specific task. Instead it might be better to
point for training to solve the more challenging task.

Applications: Large-Scale Deep Learning : Computer Vision, Speech Recognition, Natural


Language Processing

Common Applications of Deep Learning

Deep learning has many uses in many fields, and its potential grows. Let’s
analyze a few of artificial intelligence’s widespread profound learning uses.

 Image Recognition and Computer Vision


 Natural Language Processing (NLP)
 Speech Recognition and Voice Assistants
 Recommendation Systems
 Autonomous Vehicles
 Healthcare and Medical Imaging
 Fraud Detection and Cybersecurity
 Gaming and Virtual Reality

Image Recognition and Computer Vision


The performance of image recognition and computer vision tasks has
significantly improved due to deep learning. Computers can now reliably
classify and comprehend images owing to training deep neural networks on
enormous datasets, opening up a wide range of applications.

A smartphone app that can instantaneously determine a dog’s breed from a


photo and self-driving cars that employ computer vision algorithms to
detect pedestrians, traffic signs, and other roadblocks for safe navigation
are two examples of this in practice.

Deep Learning Models for Image Classification

The process of classifying photos entails giving them labels based on the
content of the images. Convolutional neural networks (CNNs), one type of
deep learning model, have performed exceptionally well in this context.
They can categorize objects, situations, or even specific properties within
an image by learning to recognize patterns and features in visual
representations.
Object Detection and Localization using Deep Learning

Object detection and localization go beyond image categorization by


identifying and locating various things inside an image. Deep learning
methods have recognized and localized objects in real-time, such as You
Only Look Once (YOLO) and region-based convolutional neural networks
(R-CNNs). This has uses in robotics, autonomous cars, and surveillance
systems, among other areas.
Deep learning has completely changed the field of facial recognition.
Hence, allowing for the precise identification of people using their facial
features. Security systems, access control, monitoring, and law
enforcement use facial recognition technology. Deep learning methods
have also been applied in biometrics for functions including voice
recognition, iris scanning, and fingerprint recognition.
Natural language processing (NLP) aims to make it possible for computers
to comprehend, translate, and create human language. NLP has
substantially advanced primarily to deep learning, making strides in several
language-related activities. Virtual voice assistants like Apple’s Siri and
Amazon’s Alexa, who can comprehend spoken orders and questions, are a
practical illustration of this.

Deep Learning for Text Classification and Sentiment


Analysis

Text classification entails classifying text materials into several groups or


divisions. Deep learning models like recurrent neural networks
(RNNs) and long short-term memory (LSTM) networks have been
frequently used for text categorization tasks. To ascertain the sentiment or
opinion expressed in a text, whether good, negative, or neutral, sentiment
analysis is a widespread use of text categorization.
Learning

Machine translation systems have considerably improved because of deep


learning. Deep learning-based neural machine translation (NMT) models
have been shown to perform better when converting text across multiple
languages. These algorithms can gather contextual data and generate
more precise and fluid translations. Deep learning models have also been
applied to creating news stories, poetry, and other types of text, including
coherent paragraphs.

Question Answering and Chatbot Systems Using Deep


Learning

Deep learning is used by chatbots and question-answering programs to


recognize and reply to human inquiries. Transformers and attention
mechanisms, among other deep learning models, have made tremendous
progress in understanding the context and semantics of questions and
producing pertinent answers. Information retrieval systems, virtual
assistants, and customer service all use this technology.
The creation of voice assistants that can comprehend and respond to
human speech and the advancement of speech recognition systems have
significantly benefited from deep learning. A real-world example is using
your smartphone’s voice recognition feature to dictate messages rather
than typing them and asking a smart speaker to play your favorite tunes or
provide the weather forecast.

Deep Learning Models for Automatic Speech


Recognition

Systems for automatic speech recognition (ASR) translate spoken words


into written text. Recurrent neural networks and attention-based models, in
particular, have substantially improved ASR accuracy. Better voice
commands, transcription services, and accessibility tools for those with
speech difficulties are the outcome. Some examples are voice search
features in search engines like Google, Bing, etc.

Voice Assistants Powered by Deep Learning Algorithms

Daily, we rely heavily on voice assistants like Siri, Google Assistant, and
spoken requests. The technology also enables voice assistants to
recognize speech, decipher user intent, and deliver precise and pertinent
responses thanks to deep learning models.

Applications in Transcription and Voice-Controlled


Systems

Deep learning-based speech recognition has applications in transcription


services, where large volumes of audio content must be accurately
converted into text. Voice-controlled systems, such as smart homes and in-
car infotainment systems, utilize deep learning algorithms to enable hands-
free control and interaction through voice commands.

Recommendation Systems

Recommendation systems use deep learning algorithms to offer people


personalized recommendations based on their tastes and behavior.
A standard method used in recommendation systems to suggest
products/services to users based on how they are similar to other users is
collaborative filtering. Collaborative filtering has improved accuracy and
performance thanks to deep learning models like
matrix factorization and deep autoencoders, which have produced more
precise and individualized recommendations.

Personalized Recommendations Using Deep Neural


Networks

Deep neural networks have been used to identify intricate links and
patterns in user behavior data, allowing for more precise and individualized
suggestions. Deep learning algorithms can forecast user preferences and
make relevant product, movie, or content recommendations by looking at
user interactions, purchase history, and demographic data. An instance of
this is when streaming services recommend films or TV shows based on
your interests and history.

Applications in E-Commerce and Content Streaming


Platforms

Deep learning algorithms are widely employed to fuel recommendation


systems in e-commerce platforms and video streaming services
like Netflix and Spotify. These programs increase user pleasure and
engagement by assisting users in finding new goods, entertainment, or
music that suits their tastes and preferences.
Deep learning has significantly impacted how well autonomous vehicles
can understand and navigate their surroundings. These vehicles can
analyze enormous volumes of sensor data in real-time using powerful deep
learning algorithms. Thus, enabling them to make wise decisions, navigate
challenging routes, and guarantee the safety of passengers and
pedestrians. This game-changing technology has prepared the path for a
time when driverless vehicles will completely change how we travel.

Deep Learning Algorithms for Object Detection and


Tracking

Autonomous vehicles must perform crucial tasks, including object


identification and tracking, to recognize and monitor objects like
pedestrians, cars, and traffic signals. Convolutional and recurrent neural
networks (CNNs) and other deep learning algorithms have proved essential
in obtaining high accuracy and real-time performance in object detection
and tracking.
Self-Driving Cars

Autonomous vehicles are designed to make complex decisions and


navigate various traffic circumstances using deep reinforcement learning.
This technology is profoundly used in self-driving cars manufactured by
companies like Tesla. These vehicles can learn from historical driving data
and adjust to changing road conditions using deep neural networks. Self-
driving cars demonstrate this in practice, which uses cutting-edge sensors
and artificial intelligence algorithms to navigate traffic, identify impediments,
and make judgments in real time.

Applications in Autonomous Navigation and Safety


Systems

The development of autonomous navigation systems that decipher sensor


data, map routes, and make judgments in real time depends heavily on
deep learning techniques. These systems focus on collision avoidance,
generate lane departure warnings, and offer adaptive cruise control to
enhance the general safety and dependability of the vehicles.
Deep learning has shown tremendous potential in revolutionizing
healthcare and medical imaging by assisting in diagnosis, disease
detection, and patient care. Revolutionizing diagnostics using AI-powered
algorithms that can precisely identify early-stage tumors from medical
imaging is an example of how to do this. This will help with prompt
treatment decisions and improve patient outcomes.

Deep Learning for Medical Image Analysis and


Diagnosis

Deep learning algorithms can glean essential insights from the enormous
volumes of data that medical imaging systems produce. Convolutional
neural networks (CNNs) and generative adversarial networks (GANs) are
examples of deep learning algorithms. They can be effectively used for
tasks like tumor identification, radiology image processing, and
histopathology interpretation.

Predictive Models for Disease Detection and Prognosis


and medical pictures to create predictive models for disease detection,
prognosis, and treatment planning.

Applications in Medical Research and Patient Care

Deep learning can revolutionize medical research by expediting


the development of new drugs, forecasting the results of treatments, and
assisting clinical decision-making. Additionally, deep learning-based
systems can also improve medical care by helping with diagnosis, keeping
track of patients’ vital signs, and making unique suggestions for dietary
changes and preventative actions.

Fraud Detection and Cybersecurity

Deep learning has become essential in detecting anomalies, identifying


fraud patterns, and strengthening cybersecurity systems.
These systems shine when finding anomalies or outliers in large datasets.
By learning from typical patterns, deep learning models may recognize
unexpected behaviors, network intrusions, and fraudulent operations.
These methods are used in network monitoring, cybersecurity systems,
and financial transactions. JP Morgan Chase, PayPal, and other
businesses are just a few that use these techniques.

Deep Neural Networks in Fraud Prevention and


Cybersecurity

In fraud prevention systems, deep neural networks have been used to


recognize and stop fraudulent transactions, credit card fraud, and identity
theft. These algorithms examine user behavior, transaction data, and
historical patterns to spot irregularities and notify security staff. This
enables proactive fraud prevention and shields customers and
organizations from financial loss. Organizations like Visa, Mastercard, and
PayPal use deep neural networks. It helps improve their fraud detection
systems and guarantees secure customer transactions.

Applications in Financial Transactions and Network


Security

Deep learning algorithms are essential for preserving sensitive data,


safeguarding financial transactions, and thwarting online threats. Deep
learning-based cybersecurity systems can proactively identify and reduce
potential hazards, protecting vital data and infrastructure by learning and
adapting to changing attack vectors over time.
Deep learning has significantly improved game AI, character animation,
and immersive surroundings, benefiting the gaming industry and virtual
reality experiences. A virtual reality game, for instance, can adjust and
customize its gameplay experience based on the player’s real-time motions
and reactions by using deep learning algorithms.

Deep Learning in Game Development and Character


Animation

Deep learning algorithms have produced more intelligent and lifelike video
game characters. Game makers may create realistic animations, enhance
character behaviors, and make more immersive gaming experiences by
training deep neural networks on enormous datasets of motion capture
data.

Deep Reinforcement Learning for Game AI and


Decision-Making

Deep reinforcement learning has changed game AI by letting agents learn


strategies, adaptation to various game circumstances, and challenging and
captivating gaming.

Applications in Virtual Reality and Augmented Reality


Experiences

Experiences in augmented reality (AR) and virtual reality (VR) have been
improved mainly due to deep learning. Deep neural networks are used by
VR and AR systems to correctly track and identify objects, detect
movements and facial expressions, and build real virtual worlds, enhancing
the immersiveness and interactivity of the user experience.

Conclusion

In artificial intelligence, deep learning has become a powerful technology


that allows robots to learn and make wise decisions. Deep learning in AI
has many uses, from image identification and NLP to cybersecurity and
healthcare. It has substantially improved the capabilities of AI systems,
resulting in innovations across various fields and the disruption of entire
sectors. Common applications of deep learning in AI Accenture leverages
deep learning within its AI initiatives to enhance data analytics, customer
experience, and operational efficiency.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy