NNDL Notes
NNDL Notes
Regularization is a set of techniques that can prevent overfitting in neural networks and thus
improve the accuracy of a Deep Learning model when facing completely new data from the
problem domain.
This penalty discourages the model from becoming too complex or having large parameter
values, which helps in controlling the model’s ability to fit noise in the training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping, and more.
By applying regularization, models become more robust and better at making accurate
Example An epoch is when all the training data is used at once and is defined as the total
number of iterations of all the training data in one cycle for training the machine learning
model. Another way to define an epoch is the number of passes a training dataset takes
around an algorithm.
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning
results in High variance. Then the model does not categorize the data correctly, because of
too many details and noise. The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have more freedom in
building the model based on the dataset and therefore they can really build unrealistic
models. A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
Parameter Norm Penalties are regularization methods that apply a penalty to the norm of
parameters in the objective function of a neural network.
Lasso Regression
where,
m – Number of Features
n – Number of Examples
y_i – Actual Target Value
y_i(hat) – Predicted Target Value
Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty
term to the loss function(L).
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.
In L2, we have:
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our model.
Otherwise, we usually prefer L2 over it.
In keras, we can directly apply regularization to any layer using the regularizers. Below I
have applied regularizer on dense layer having 500 neurons and relu activation function.
In [11]:
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Flatten())
#l2 regularizer
model.add(Dense(500,kernel_regularizer=regularizers.l2(0.01),activation="relu"))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda, which we
need to optimize further
Similarly, we can also apply L1 regularization.
we can construct a generalized Lagrangian function containing the objective function along
with the penalties can be increased or decreased. Suppose we wanted Ω(θ) < k, then we
compensated highly and hence, α should be large to reduce its value below k.
Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α should be
small. This is now similar to the parameter norm penalty regularized objective function as
both of them encourage lower values of the norm. Thus, parameter norm penalties naturally
Larger α implies a smaller constrained region as it pushes the values really low, hence,
allowing a small radius and vice versa. The idea of constraints over penalties is important for
several reasons. Large penalties might cause non-convex optimization algorithms to get stuck
in local minima due to small values of θ, leading to the formation of so-called dead cells, as
the weights entering and leaving them are too small to have an impact.
Constraints don’t enforce the weights to be near zero, rather being confined to a constrained
region.
Underdetermined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with as a solution, will always
necessary. For e.g., many algorithms require the inversion of X’ X, which might be singular. In
such a case, we can use a regularized form instead. (X’ X + αI) is guaranteed to be invertible.
Regularization can solve underdetermined problems. For e.g. the Moore-Pentose proposed
Many linear models in machine learning, including linear regression depend on inverting the
whenever the data generating distribution truly has no variance in some direction, or when
no variance in observed in some direction because there are fewer examples (rows of X)
than input features (columns of X). In this case, many forms of regularization correspond to
inverti
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data. In machine
learning, we were not able to increase the size of training data as the labeled data was too
costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In
the below image, some transformation has been done on the handwritten digits dataset.
This technique is known as data augmentation. This usually provides a big leap in improving
the accuracy of the model. It can be considered as a mandatory trick in order to improve our
predictions.
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images
datagen.fit(x_train)
Dropout
This is the one of the most interesting types of regularization techniques. It also produces
very good results and is consequently the most frequently used regularization technique in
the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one shown
So what does dropout do? At every iteration, it randomly selects some nodes and removes
them along with all of their incoming and outgoing connections as shown below.
So each iteration has a different set of nodes and this results in a different set of outputs. It
can also be thought of as an ensemble technique in machine learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network model.
This probability of choosing how many nodes should be dropped is the hyperparameter of
the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
Due to these reasons, dropout is usually preferred when we have a large neural network
structure in order to introduce more randomness.
In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.5 as the probability of dropping in my
neural network architecture after last hidden layer having 64 kernels and after first dense
layer having 500 neurons.
example
linkcode
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
# 1st dropout
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
# 2nd dropout
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the training
set as the validation set. When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model. This is known as early stopping.
In the above image, we will stop training at the dotted line since after that our model will
in keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop immendiately if
validation error will not decreased after 3 epochs.
In [14]:
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc', patience=3)
epochs = 20 #
batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes the
validation error.
Patience denotes the number of epochs with no further improvement after which the
training will be stopped. For better understanding, let’s take a look at the above image
again. After the dotted line, each epoch will result in a higher value of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our model will
stop because no further improvement is seen.
Noise Robustness
Noise applied to inputs is a data augmentation, For some models addition of noise with
extremely small variance at the input is equivalent to imposing a penalty on the norm of the
weights.
Noise applied to hidden units, Noise injection can be much more powerful than simply
shrinking the parameters. Noise applied to hidden units is so important that Dropout is the
main development of this approach.
Training a neural network with a small dataset can cause the network to memorize all
training examples, in turn leading to overfitting and poor performance on a holdout dataset.
One approach to making the input space smoother and easier to learn is to add noise to
inputs during training.
Small datasets can make learning challenging for neural nets and the examples can be
memorized.
Adding noise during training can make the training process more robust and reduce
generalization error.
Noise is traditionally added to the inputs, but can also be added to weights, gradients, and
even activation functions.
random noise can be added to other parts of the network during training. Some examples
include:
The addition of noise to weights allows the approach to be used throughout the network in
a consistent way instead of adding noise to inputs and layer activations. This is particularly
useful in recurrent neural networks.
The addition of noise to gradients focuses more on improving the robustness of the
optimization process itself rather than the structure of the input domain. The amount of
noise can start high at the beginning of training and decrease over time, much like a
decaying learning rate. This approach has proven to be an effective method for very deep
networks and for a variety of different network types
Adding noise to the activations, weights, or gradients all provide a more generic approach to
adding noise that is invariant to the types of input variables provided to the model.
If the problem domain is believed or expected to have mislabeled examples, then the
addition of noise to the class label can improve the model’s robustness to this type of error.
Although, it can be easy to derail the learning process.
Adding noise to a continuous target variable in the case of regression or time series
forecasting is much like the addition of noise to the input variables and may be a better use
case.
Semi-Supervised Learning
Semi-supervised learning is a type of machine learning that falls in between supervised
and unsupervised learning. It is a method that uses a small amount of labeled data and a
large amount of unlabeled data to train a model. The goal of semi-supervised learning is to
learn a function that can accurately predict the output variable based on the input
variables, similar to supervised learning. However, unlike supervised learning, the
algorithm is trained on a dataset that contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of unlabeled
data available, but it’s too expensive or difficult to label all of it.
Multi-Task Learning
Hard Parameter Sharing – A common hidden layer is used for all tasks but several task
specific layers are kept intact towards the end of the model. This technique is very useful
as by learning a representation for various tasks by a common hidden layer, we reduce the
risk of overfitting.
Soft Parameter Sharing – Each model has their own sets of weights and biases and
the distance between these parameters in different models is regularized so that
the parameters become similar and can represent all the tasks.
Parameter Typing
Two models are doing the same classification task (with the same set of classes), but their
input distributions are somewhat different.
We have model A has the parameters
Another model B has the parameters .
W(A)
and
W(B)
are the two models that transfer the input to two different but related outputs.
Assume the tasks are comparable enough (possibly with similar input and output
distributions) that the model parameters should be near to each
We can take advantage of this data by regularising it. We can apply a parameter norm
penalty of the following form We utilised an L2 penalty here, but there are other options.
Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm, were
regularised to be close to the parameters of another model, trained in an unsupervised
paradigm, using this method (to capture the distribution of the observed input data).
Many of the parameters in the classifier model might be linked with similar parameters in
the unsupervised model thanks to the designs.
Example : Convolutional neural networks (CNNs) used in computer vision are by far the
most widespread and extensive usage of parameter sharing. Many statistical features of
natural images are translation insensitive. A shot of a cat, for example, can be translated
one pixel to the right and still be a shot of a cat. By sharing parameters across several
picture locations, CNNs take this property into account. Different locations in the input are
computed with the same feature (a hidden unit with the same weights).
Sparse Representations
Sparse representation (SR) is used to represent data with as few atoms as possible in a given
overcomplete dictionary. By using the SR, we can concisely represent the data and easily
extract the valuable information from the data
the terms "sparse" and "dense" are commonly used to describe the distribution of zero
and non-zero array members in machine learning (e.g. vector or matrix). Sparse matrices
are those that primarily consist of zeros, while dense matrices have a large number of
nonzero entries.
Machine learning makes use of sparse and dense representations due to their usefulness in
efficient data representation. While dense representations are useful for capturing intricate
interactions between data points, sparse representations can help minimize the amount of
a dataset.
sparse Matrix Representations can be done in many ways following are two common
representations:
1. Array representation
2. Linked list representation
Example -
Let's understand the array representation of sparse matrix with the help of the example
given below -
In the above figure, we can observe a 5x4 sparse matrix containing 7 non-zero elements and
13 zero elements. The above matrix occupies 5x4 = 20 memory space. Increasing the size of
matrix will increase the wastage space.
The size of the table depends upon the total number of non-zero elements in the given
sparse matrix. Above table occupies 8x3 = 24 memory space which is more than the space
occupied by the sparse matrix. So, what's the benefit of using the sparse matrix? Consider
the case if the matrix is 8*8 and there are only 8 non-zero elements in the matrix, then the
space occupied by the sparse matrix would be 8*8 = 64, whereas the space occupied by the
table represented using triplets would be 8*3 = 24.
Example -
Let's understand the linked list representation of sparse matrix with the help of the example
given below -
In the above figure, the sparse matrix is represented in the linked list form. In the node, the
first field represents the index of the row, the second field represents the index of the
column, the third field represents the value, and the fourth field contains the address of the
next node.
In the above figure, the first field of the first node of the linked list contains 0, which means
0th row, the second field contains 2, which means 2nd column, and the third field contains 1
that is the non-zero element. So, the first node represents that element 1 is stored at the
0th row-2nd column in the given sparse matrix. In a similar manner, all of the nodes represent
the non-zero elements of the sparse matrix.
sparse code follows the more all-encompassing idea of neural code. Consider the case
when you have binary neurons. So, basically:
The neural networks will get some inputs and deliver outputs
Some neurons in the neural network will be frequently activated while others won’t
be activated at all to calculate the outputs
The average activity ratio refers to the number of activations on some data, whereas
the neural code is the observation of those activations for a specific input
Neural coding is the process of instructing your neurons to produce a reliable neural
code
Now that we know what a neural code is, we can speculate on what it may be like. Then,
data will be encoded using a sparse code while taking into consideration the following
scenarios:
Bagging or Bootstrap Aggregating is an ensemble learning method that is used to reduce the
error by training homogeneous weak learners on different random samples from the
training set, in parallel. The results of these base learners are then combined through voting
or averaging approach to produce an ensemble model that is more robust and accurate.
Bagging mainly focuses on obtaining an ensemble model with lower variance than the
individual base models composing it. Hence, bagging techniques help avoid the overfitting
of the model.
Benefits of Bagging
Reduce Overfitting
Improve Accuracy
Handles Unstable Models
Note: Random Forest Algorithm is one of the most common Bagging Algorithm.
Image classification
the kind of animal, is available for the purpose of training a model. In a traditional modeling
approach, we would try several techniques and calculate the accuracy to choose one over
the other. Imagine we used logistic regression, decision tree, and support vector machines
In the above example, it was observed that a specific record was predicted as a dog by the
logistic regression and decision tree models, while a support vector machine identified it as a
cat. As various models have their distinct advantages and disadvantages for particular
records, it is the key idea of ensemble learning to combine all three models instead of
The procedure is called aggregation or voting and combines the predictions of all underlying
models, to come up with one prediction that is assumed to be more precise than any sub-
XGBoost
Adaptive Learning
Reduces Bias
Flexibility
How is Boosting Model Trained to Make Predictions
Samples generated from the training set are assigned the same weight to start with.
These samples are used to train a homogeneous weak learner or base model.
The prediction error for a sample is calculated – the greater the error, the weight of the
sample increases. Hence, the sample becomes more important for training the next base
model.
The individual learner is weighted too – does well on its predictions, gets a higher
weight assigned to it. So, a model that outputs good predictions will have a higher say in
the final decision.
The weighted data is then passed on to the following base model, and steps 2 and step 3
are repeated until the data is fitted well enough to reduce the error below a certain
threshold.
When new data is fed into the boosting model, it is passed through all individual base
models, and each model makes its own weighted prediction.
Weight of these models is used to generate the final prediction. The predictions are
scaled and aggregated to produce a final prediction.
Key Difference Between Bagging and Boosting
The bagging technique combines multiple models trained on different subsets of data,
whereas boosting trains models sequentially, focusing on the error made by the previous
model.
Bagging is best for high variance and low bias models while boosting is effective when the
model must be adaptive to errors, suitable for bias and variance errors.
Generally, boosting techniques are not prone to overfitting. Still, it can be if the number
of models or iterations is high, whereas the Bagging technique is less prone to overfitting.
Bagging improves accuracy by reducing variance, whereas boosting achieves accuracy by
reducing bias and variance.
Boosting is suitable for bias and variance, while bagging is suitable for high-variance and
low-bias models.
Bias:While making predictions, a difference occurs between prediction values made by the
model and actual values/expected values, and this difference is known as bias errors or
Errors due to bias
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Variance:the variance would specify the amount of variation in the prediction if the
different training data was used. In simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a model should not vary too much
from one training dataset to another, which means the algorithm should be good in
understanding the hidden mapping between inputs and output variables. Variance errors
are either of low variance or high variance.
o Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.
It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).
Tangent propagation is closely related to dataset augmentation. In both cases, the user of
the algorithm encodes his or her prior knowledge of the task by specifying a set of
transformations that should not alter the output of the network.
The difference is that in the case of dataset augmentation, the network is explicitly trained
to correctly classify distinct inputs that were created by applying more than an infinitesimal
amount of these transformations.
tangent propagation does not require explicitly visiting a new input point. Instead, it
analytically regularizes the model to resist perturbation in the directions corresponding to
the specified transformation. While this analytical approach is intellectually elegant,
it has two major drawbacks. First, it only regularizes the model to resist infinitesimal
perturbation. Explicit dataset augmentation confers resistance to larger perturbations(
means changes in datasets) Second, the infinitesimal approach poses difficulties for models
based on rectified linear units. These models can only shrink their derivatives by turning
units off or shrinking their weights.
They are not able to shrink their derivatives by saturating at a high value with large weights,
as sigmoid or tanh units can. Dataset augmentation works well with rectified linear units
because different subsets of rectified units can activate for different transformed versions of
each original input. Tangent propagation is also related to double backprop (Drucker and
LeCun, 1992) and adversarial training
The TANGENTPROP algorithm assumes various training derivatives of the target function are
also provided. For example, if each instance xi is described by a single real value, then each
training example may be of the form (xi, f (xi), q lx, ). Here lx, denotes the derivative of the
target function f with respect to x, evaluated at the point x = xi.
To develop an intuition for the benefits of providing training derivatives as well as training
values during learning, consider the simple learning task depicted in Figure
The task is to learn the target function f shown in the leftmost plot of the figure, based on
the three training examples shown: (xl, f (xl)), (x2, f (x2)), and (xg, f (xg)).
Given these three training examples, the BACKPROPAGATION algorithm can be expected to
hypothesize a smooth function, such as the function g depicted in the middle plot of the
figure. The rightmost plot shows the effect of
providing training derivatives, or slopes, as additional information for each training example
(e.g., (XI, f (XI), I,, )). By fitting both the training values f (xi) and these training derivatives PI,,
the learner has a better chance to correctly generalize from the sparse training data.
To summarize, the impact of including the training derivatives is to override the usual
syntactic inductive bias of BACKPROPAGATION that favors a smooth interpolation between
points, replacing it by explicit input information about required derivatives. The resulting
hypothesis h shown in the rightmost plot of the figure provides a much more accurate
estimate of the true target function f.
Each transformation must be of the form sj(a, x) where aj is a continuous parameter, where
sj is differentiable, and where sj(O, x) = x (e.g., for rotation of zero degrees the
transformation is the identity function). For each such transformation, sj(a, x),
In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and these
instances fit to proper hypothesis shown in first figure and in second fig we can see the
instances classified and machine learns to fit to proper hypothesis by doing necessary
modification by using
TANGEPROP considers the squared error between the specified training derivative and the
actual derivative of the learned neural network. The modified error function is
where p is a constant provided by the user to determine the relative importance of fitting
training values versus fitting training derivatives.
Notice the first term in this definition of E is the original squared error of the network versus
training values, and the second term is the squared error in the network versus training
derivatives.
In the third figure we can see the instances are classified properly and maintaining accuracy.
An Illustrative Example
Remarks To summarize, TANGENTPROP uses prior knowledge in the form of desired
derivatives of the target function with respect to transformations of its inputs.
It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic
Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates,
Approximate Second Order Methods, Optimization Strategies and Meta-Algorithms
There are several types of optimization in deep learning algorithms but the most interesting
The core of deep learning optimization relies on trying to minimize the cost function of a
model without affecting its training performance. That type of optimization problem
contrasts with the general optimization problem in which the objective is to simply minimize
ex:training).
Most optimization algorithms in deep learning are based on gradient estimations. In that
context, optimization algorithms try to reduce the gradient of specific cost functions
algorithms that use the entire training set at once are called deterministic. Other techniques
that use one training example at a time has come to be known as online algorithms. Similarly,
algorithms that use more than one but less than the entire training dataset during the
The most famous method of stochastic optimization which is also the most common
Regardless of the type of optimization algorithm used, the process of optimizing a deep
There are plenty of challenges in deep learning optimization but most of them are related to
the nature of the gradient of the model. Below, I’ve listed some of the most common
challenges in deep learning optimization that you are likely to run into:
a)Local Minima: local minima is a permanent challenge in the optimization of any deep
learning algorithm. The local minima problem arises when the gradient encounters many
local minimums that are different and not correlated to a global minimum for the cost
function.
B.saddle points
saddle points are another reason for gradients to vanish. A saddle point is any location
where all gradients of a function vanish but which is neither a global nor a local minimum.
Flat Regions: In deep learning optimization models, flat regions are common areas that
represent both a local minimum for a sub-region and a local maximum for another. That
d) Local vs. Global Structures: Another very common challenge in the optimization of deep
leavening models is that local regions of the cost function don’t correspond with its global
Solution: Gradient clipping, advanced weight initialization, and skip connections help a
computer learn things accurately and consistently.
Overfitting
Overfitting happens when a model knows too much about the training data, so it can't make
good predictions about new data. As a result, the model performs well on the training data
but struggles to make accurate predictions on new, unseen data. It's essential to address
overfitting by employing techniques like regularization, cross-validation, and more diverse
datasets to ensure the model generalizes well to unseen examples.
Regularisation techniques help us ensure our models memorize the data and use what
they've learned to make good predictions about new data. Techniques like dropout, L1/L2
regularisation, and early stopping can help us do this.
the model during training, enabling it to learn more effectively and make accurate
predictions.
Solution: Apply data augmentation techniques like rotation, translation, and flipping
alongside data normalization and proper handling of missing values.
Label Noise
Training data sometimes need to be corrected, making it hard for computers to do things
well.
Solution: Using special kinds of math called "loss functions" can help ensure that the model
you are using is not affected by label mistakes.
Imbalanced Datasets
Datasets can have too many of one type of thing and need more of another type. This can
cause models not to work very well for things not represented as much.
Solution: Classes can sometimes be uneven, meaning more people are in one group than
another. To fix this, we can use special techniques like class weighting, oversampling, or
data synthesis to ensure that all the classes have the same number of people.
Computational Resource Constraints
Training deep neural networks can be very difficult and take a lot of computer power,
especially if the model is very big.
Solution: Using multiple computers or special chips called GPUs and TPUs can help make
learning faster and easier.
Hyperparameter Tuning
Deep neural networks have numerous hyperparameters that require careful tuning to
achieve optimal performance.
Solution: To efficiently find the best hyperparameters, such as Bayesian optimization or
genetic algorithms, utilize automated hyperparameter optimization methods.
Convergence Speed
It is important to ensure a model works quickly when using lots of data and complicated
designs.
Solution: Adopt learning rate scheduling or adaptive algorithms like Adam or RMSprop to
expedite convergence.
Memory Constraints
Computers need a lot of memory to train large models and datasets, but they can work
properly if there is enough memory.
Solution: Reduce memory usage by applying model quantization, using mixed-precision
training, or employing memory-efficient architectures like MobileNet or EfficientNet.
Transfer Learning and Domain Adaptation
Deep learning networks need lots of data to work well. If they don't get enough data or the
data is different, they won't work as well.
Solution: Leverage transfer learning or domain adaptation techniques to transfer knowledge
from pre-trained models or related domains.
.
Adversarial Attacks
Deep neural networks are unique ways of understanding data. But they can be tricked by
minimal changes that we can't see. This can make them give wrong answers.
Interpretability and Explainability
Understanding the decisions made by deep neural networks is crucial in critical applications
like healthcare and autonomous driving.
Solution: Adopt techniques such as LIME (Local Interpretable Model-Agnostic Explanations)
or SHAP (SHapley Additive exPlanations) to explain model predictions.
Handling Sequential Data
Training deep neural networks on sequential data, such as time series or natural language
sequences, presents unique challenges.
Solution: Utilize specialized architectures like recurrent neural networks (RNNs) or
transformers to handle sequential data effectively.
Limited Data
Training deep neural networks with limited labeled data is a common challenge, especially
in specialized domains.
Solution: Consider semi-supervised, transfer, or active learning to make the most of
available data.
Catastrophic Forgetting
When a model forgets previously learned knowledge after training on new data, it
encounters the issue of catastrophic forgetting.
Solution: Implement techniques like elastic weight consolidation (EWC) or knowledge
distillation to retain old knowledge during continual learning.
Hardware and Deployment Constraints
Using trained models on devices with not much computing power can be hard.
Solution: Scientists use special techniques to make computer models run better on devices
with limited resources.
Data Privacy and Security
When training computers to do complex tasks, it is essential to keep data private and ensure
the computers are secure.
Solution: Employ federated learning, secure aggregation, or differential privacy techniques
to protect data and model privacy.
Long Training Times
Training deep neural networks is like doing a challenging puzzle. It takes a lot of time to
assemble the puzzle, especially if it is vast and has a lot of pieces.
Solution: Special tools like GPUs or TPUs can help us train our computers faster. We can also
try using different computers simultaneously to make the training even quicker.
Exploding Memory Usage
Some models are too big and need a lot of space, so they are hard to use on regular
computers.
Solution: Explore memory-efficient architectures, use gradient checkpointing, or consider
model parallelism for training.
Learning Rate Scheduling
Setting an appropriate learning rate schedule can be challenging, affecting model
convergence and performance.
Solution: Using special learning rate schedules can help make learning easier and faster.
These schedules can be used to help teach things in a better way.
final performance.
Solution: Using unique strategies like simulated annealing, momentum-based optimization,
and evolutionary algorithms can help us escape difficult spots.
Unstable Loss Surfaces
Finding the best way to do something can be very hard when there are many different
options because the surface it is on is complicated and bumpy.
Solution: Utilize weight noise injection, curvature-based optimization, or geometric
methods to stabilize loss surfaces.
Ill-Conditioned Matrix
In neural network the adjustments of weights computation and calculation in hidden layer
when calculate in matrix form it simply tells us the characteristics of the matrix in terms of
further computations and calculations, or formally it can be defined as a measure of how much
the output value of the function can change for a small change in the input argument.
A matrix is said to be Ill-conditioned if the condition number is very high, so for a small
change in the input function/the Hessian matrix (The Hessian Matrix is a square matrix of
second ordered partial derivatives of a scalar function. It is of immense use in linear algebra
as well as for determining points of local maxima or minima. ) we will end up getting outputs
Basic Algorithms
In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.
Stochastic Gradient Descent Algorithm
Initialization: Randomly initialize the parameters of the model.
Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the
shuffled order.
c. Compute the gradient of the cost function with respect to the
model parameters using the current training example (or
batch).
d. Update the model parameters by taking a step in the direction
of the negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the
cost function between iterations of the gradient.
Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
Stochastic gradient descent (SGD) with momentum
the momentum algorithm introduces a variable v that plays the role of velocity—it is the direction
and speed at which the parameters move through parameter space. The velocity is set to an
exponentially decaying average of the negative gradient. The name momentum derives from a
physical analogy, in which the negative gradient is a force moving a particle through parameter
space, according to Newton’s laws of motion. Momentum in physics is mass times velocity. In the
momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded
as the momentum of the particle The algorithm is balanced with momentum and steps velocity is
added as
Biases are often chosen heuristically (zero mostly) and only the
weights are randomly initialized, almost always from a Gaussian or
uniform distribution. The scale of the distribution is of utmost
concern. Large weights might have better symmetry-breaking effect
but might lead to chaos (extreme sensitivity to small perturbations
in the input) and exploding values during forward & back
propagation. As an example of how large weights might lead to
would add a factor of W * ϵ to the output. In case the weights are
high, this ends up making a significant contribution to the output.
SGD and its variants tend to halt in areas near the initial values,
thereby expressing a prior that the path to the final parameters from
the initial values is discoverable by steepest descent algorithms.
U(a, b) represents the uniform distribution where the probability of each value between a and
b, a and b inclusive, is 1/(b-a). The probability of every other value is 0.
If the weights are too small, the range of activations across the mini-
batch will shrink as the activations propagate forward through the
network.By repeatedly identifying the first layer with unacceptably
small activations and increasing its weights, it is possible to
eventually obtain a network with reasonable initial activations
throughout.
The biases are relatively easier to choose. Setting the biases to zero is
compatible with most weight initialization schemes except for a few
cases .
1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 +
Now, the previous search direction contributes towards finding the next
search direction.
Deep learning has many uses in many fields, and its potential grows. Let’s
analyze a few of artificial intelligence’s widespread profound learning uses.
The process of classifying photos entails giving them labels based on the
content of the images. Convolutional neural networks (CNNs), one type of
deep learning model, have performed exceptionally well in this context.
They can categorize objects, situations, or even specific properties within
an image by learning to recognize patterns and features in visual
representations.
Object Detection and Localization using Deep Learning
Daily, we rely heavily on voice assistants like Siri, Google Assistant, and
spoken requests. The technology also enables voice assistants to
recognize speech, decipher user intent, and deliver precise and pertinent
responses thanks to deep learning models.
Recommendation Systems
Deep neural networks have been used to identify intricate links and
patterns in user behavior data, allowing for more precise and individualized
suggestions. Deep learning algorithms can forecast user preferences and
make relevant product, movie, or content recommendations by looking at
user interactions, purchase history, and demographic data. An instance of
this is when streaming services recommend films or TV shows based on
your interests and history.
Deep learning algorithms can glean essential insights from the enormous
volumes of data that medical imaging systems produce. Convolutional
neural networks (CNNs) and generative adversarial networks (GANs) are
examples of deep learning algorithms. They can be effectively used for
tasks like tumor identification, radiology image processing, and
histopathology interpretation.
Deep learning algorithms have produced more intelligent and lifelike video
game characters. Game makers may create realistic animations, enhance
character behaviors, and make more immersive gaming experiences by
training deep neural networks on enormous datasets of motion capture
data.
Experiences in augmented reality (AR) and virtual reality (VR) have been
improved mainly due to deep learning. Deep neural networks are used by
VR and AR systems to correctly track and identify objects, detect
movements and facial expressions, and build real virtual worlds, enhancing
the immersiveness and interactivity of the user experience.
Conclusion